« tweet tweet | Main | (semantic web) - semantics = linked data? »

Querying wiki/dbpedia for presidents' ages at inauguration

Easier than Jon Udell had thought.

In an August 19th interview with Jon Udell, David Huynh of Freebase (and formerly of MIT's Project Simile) introduced his Freebase demo by describing a hypothetical query to a database asking for presidents' ages when they are inaugurated and whether there's a trend that we're getting younger presidents. Jon replies:

If it were possible to issue a database query over Wikipedia, then you could ask a question like that; you could say give me—well first of all, it would presume that you could identify US presidents, and it would further presume that you could find a field within those documents that would say the ages of those people, and that's not really part of the structure of Wikipedia. This information can be explicitly made available in Freebase. It hasn't in all cases, and that's part of the social process. So it ultimately relies on people to refine this raw information that came from Wikipedia and elsewhere so that it is more fielded and structured.
DBPedia + SPARQL is my new favorite toy.

They then go on to do such a query with Freebase... but they could have done it with Wikipedia, with a little help from SPARQL and DBpedia.

Wikipedia has plenty of fielded information in infoboxes. DBpedia lets you access this collection of data via a SPARQL endpoint. While Wikipedia (and hence DBpedia) have no field for a president's age at inauguration, it does store their birthdate and the year they began their first term, so calculating their ages when they each became president is pretty easy.

You could see a list of US Presidents by going to Wikipedia's List of Presidents of the United States page, but let's do it programatically with this SPARQL query so that we can build from there to get their ages at inauguration. We ask for the things in the database that have a subject of "Presidents of the United States":

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT ?presName WHERE {
  ?presName skos:subject <http://dbpedia.org/resource/Category:Presidents_of_the_United_States>.
}

To see the query in action, click this executable URL version.

A slightly more complex query lists the name, birth date, and beginning of the first term of each one:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX dbpedia2: <http://dbpedia.org/property/>
SELECT ?presName,?birthday, ?startDate WHERE {
  ?presName skos:subject <http://dbpedia.org/resource/Category:Presidents_of_the_United_States>;
            dbpedia2:birth ?birthday;
            dbpedia2:presidentStart ?startDate.
}

Click here to see it in action. Because the fielded information for the various presidents is not consistent, only 19 of them have dbpedia2:birth and dbpedia2:presidentStart fields, so you'll only see those presidents returned for this query. Wikipedia pages for all US presidents do have this information, but it's not always named the same way—compare the dbpedia pages for Zachary Taylor and Lyndon Johnson, who doesn't show up on the list return by that last query, for some examples. As Jon said, filling out that data is part of the social process.

The real promise of Linked Data is the ability to write a program or script that grabs the data and does something with it, so I wrote a two-line batch file that:

  1. uses curl to send that URL to DBpedia and store the results in an XML file
  2. runs a short XSLT script to calculate the presidents' ages at inauguration

It doesn't look much like a two-line batch file here, so before running it, replace the first six carriage returns with spaces to turn the first seven lines into one:

curl -o presidentAges.xml -F "query=PREFIX dbpedia2: 
  <http://dbpedia.org/property/> PREFIX skos: 
  <http://www.w3.org/2004/02/skos/core#> SELECT ?presName,?birthday, 
  ?startDate WHERE { ?presName skos:subject 
  <http://dbpedia.org/resource/Category:Presidents_of_the_United_States>; 
  dbpedia2:birth ?birthday; dbpedia2:presidentStart ?startDate.}" 
  http://dbpedia.org/sparql 
xsltproc presidentAges.xsl presidentAges.xml

Here is the XSLT stylesheet, presidentAges.xsl:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns:s="http://www.w3.org/2005/sparql-results#"
                version="1.0">

  <xsl:strip-space elements="*"/>
  <xsl:output method="text"/>


  <xsl:template match="s:result">

    <xsl:variable name="birthYear"
                  select="substring(
                          s:binding[@name='birthday']/s:literal,1,4)"/>
    <xsl:variable name="presidentName"
                  select="substring(s:binding[@name='presName']/s:uri,29)"/>

    <xsl:value-of select="translate($presidentName,'_',' ')"/>
    <xsl:text> </xsl:text>
    <xsl:value-of select="s:binding[@name='startDate']/
                          s:literal - $birthYear - 1"/>
<xsl:text>
</xsl:text>
  </xsl:template>
</xsl:stylesheet>

I subtracted the birth year and then another 1 from the startDate because with inaugurations being in January (at least in modern times) I assumed that each president hadn't reached his birthday yet. Here is the result:

Abraham Lincoln 51
Andrew Johnson 56
Bill Clinton 46
Chester A. Arthur 50
Franklin Pierce 48
George H. W. Bush 64
George Washington 56
Harry S. Truman 60
James K. Polk 49
James Monroe 58
John Adams 61
John Quincy Adams 57
Martin Van Buren 54
Millard Fillmore 49
Richard Nixon 55
Rutherford B. Hayes 54
Thomas Jefferson 57
Ulysses S. Grant 46
Zachary Taylor 64

I never realized that Grant was the same age as Clinton when he started—a year younger than Obama is now—but having led the army that won the US Civil War, I guess he had reasons to look a bit older at the start of his term.

DBPedia + SPARQL is my new favorite toy, and I'm getting more and more ideas lately about useful (or at least fun) things to do with the combination.

Comments

(Note: I usually close comments for an entry a few weeks after posting it to avoid comment spam.)

Bob,

Nice demo :-)

Your juxtaposition of the Udell comments re. parallax provide much needed additional insight re. utility of DBpedia and SPARQL.

Kingsley

> and whether there's a trend that we're getting younger presidents.

I think for that part we still need a time against age graph with a poly-fit, preferably in SVG.

Kingsley has sent me version of this query that uses an OpenLink extension to do the age calculation as part of the SPARQL query, without the need for the XSLT part. Click http://tinyurl.com/48l6c6 to see the query and execute it. Very cool.

Hi Bob,

fantastic stuff! The issue you were having with different rdf properties of the same relation will be solved shortly. I'll release a new version of the infobox dataset (based on a new extraction approach) in the next days.

Cheers, Georgi

Due to underlying engine change and DBpedia instance update, here is the revised query:

SELECT ?presName, ?birthday, ?startDate, (bif:datediff("year", ?birthday, xsd:date(bif:sprintf("%d-01-20", ?startDate)))) as ?age_at_innaguration

WHERE {?presName skos:subject ;

dbpedia2:birth ?birthday;

dbpedia2:presidentStart ?startDate.

filter (datatype(?startDate) = xsd:integer)

}

Live Link: http://tinyurl.com/4edjzl