Querying wiki/dbpedia for presidents' ages at inauguration
Easier than Jon Udell had thought.
In an August 19th interview with Jon Udell, David Huynh of Freebase (and formerly of MIT's Project Simile) introduced his Freebase demo by describing a hypothetical query to a database asking for presidents' ages when they are inaugurated and whether there's a trend that we're getting younger presidents. Jon replies:
If it were possible to issue a database query over Wikipedia, then you could ask a question like that; you could say give me—well first of all, it would presume that you could identify US presidents, and it would further presume that you could find a field within those documents that would say the ages of those people, and that's not really part of the structure of Wikipedia. This information can be explicitly made available in Freebase. It hasn't in all cases, and that's part of the social process. So it ultimately relies on people to refine this raw information that came from Wikipedia and elsewhere so that it is more fielded and structured.
DBPedia + SPARQL is my new favorite toy.
They then go on to do such a query with Freebase... but they could have done it with Wikipedia, with a little help from SPARQL and DBpedia.
Wikipedia has plenty of fielded information in infoboxes. DBpedia lets you access this collection of data via a SPARQL endpoint. While Wikipedia (and hence DBpedia) have no field for a president's age at inauguration, it does store their birthdate and the year they began their first term, so calculating their ages when they each became president is pretty easy.
You could see a list of US Presidents by going to Wikipedia's List of Presidents of the United States page, but let's do it programatically with this SPARQL query so that we can build from there to get their ages at inauguration. We ask for the things in the database that have a subject of "Presidents of the United States":
PREFIX skos: <http://www.w3.org/2004/02/skos/core#> SELECT ?presName WHERE { ?presName skos:subject <http://dbpedia.org/resource/Category:Presidents_of_the_United_States>. }
To see the query in action, click this executable URL version.
A slightly more complex query lists the name, birth date, and beginning of the first term of each one:
PREFIX skos: <http://www.w3.org/2004/02/skos/core#> PREFIX dbpedia2: <http://dbpedia.org/property/> SELECT ?presName,?birthday, ?startDate WHERE { ?presName skos:subject <http://dbpedia.org/resource/Category:Presidents_of_the_United_States>; dbpedia2:birth ?birthday; dbpedia2:presidentStart ?startDate. }
Click here to see it in action. Because the fielded information for the various presidents is not consistent, only 19 of them have dbpedia2:birth and dbpedia2:presidentStart fields, so you'll only see those presidents returned for this query. Wikipedia pages for all US presidents do have this information, but it's not always named the same way—compare the dbpedia pages for Zachary Taylor and Lyndon Johnson, who doesn't show up on the list return by that last query, for some examples. As Jon said, filling out that data is part of the social process.
The real promise of Linked Data is the ability to write a program or script that grabs the data and does something with it, so I wrote a two-line batch file that:
- uses curl to send that URL to DBpedia and store the results in an XML file
- runs a short XSLT script to calculate the presidents' ages at inauguration
It doesn't look much like a two-line batch file here, so before running it, replace the first six carriage returns with spaces to turn the first seven lines into one:
curl -o presidentAges.xml -F "query=PREFIX dbpedia2: <http://dbpedia.org/property/> PREFIX skos: <http://www.w3.org/2004/02/skos/core#> SELECT ?presName,?birthday, ?startDate WHERE { ?presName skos:subject <http://dbpedia.org/resource/Category:Presidents_of_the_United_States>; dbpedia2:birth ?birthday; dbpedia2:presidentStart ?startDate.}" http://dbpedia.org/sparql xsltproc presidentAges.xsl presidentAges.xml
Here is the XSLT stylesheet, presidentAges.xsl:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:s="http://www.w3.org/2005/sparql-results#" version="1.0"> <xsl:strip-space elements="*"/> <xsl:output method="text"/> <xsl:template match="s:result"> <xsl:variable name="birthYear" select="substring( s:binding[@name='birthday']/s:literal,1,4)"/> <xsl:variable name="presidentName" select="substring(s:binding[@name='presName']/s:uri,29)"/> <xsl:value-of select="translate($presidentName,'_',' ')"/> <xsl:text> </xsl:text> <xsl:value-of select="s:binding[@name='startDate']/ s:literal - $birthYear - 1"/> <xsl:text> </xsl:text> </xsl:template> </xsl:stylesheet>
I subtracted the birth year and then another 1 from the startDate because with inaugurations being in January (at least in modern times) I assumed that each president hadn't reached his birthday yet. Here is the result:
Abraham Lincoln 51 Andrew Johnson 56 Bill Clinton 46 Chester A. Arthur 50 Franklin Pierce 48 George H. W. Bush 64 George Washington 56 Harry S. Truman 60 James K. Polk 49 James Monroe 58 John Adams 61 John Quincy Adams 57 Martin Van Buren 54 Millard Fillmore 49 Richard Nixon 55 Rutherford B. Hayes 54 Thomas Jefferson 57 Ulysses S. Grant 46 Zachary Taylor 64
I never realized that Grant was the same age as Clinton when he started—a year younger than Obama is now—but having led the army that won the US Civil War, I guess he had reasons to look a bit older at the start of his term.
DBPedia + SPARQL is my new favorite toy, and I'm getting more and more ideas lately about useful (or at least fun) things to do with the combination.
Comments
(Note: I usually close comments for an entry a few weeks after posting it to avoid comment spam.)
Bob,
Nice demo :-)
Your juxtaposition of the Udell comments re. parallax provide much needed additional insight re. utility of DBpedia and SPARQL.
Kingsley
Posted by: Kingsley Idehen | September 30, 2008 12:01 PM
> and whether there's a trend that we're getting younger presidents.
I think for that part we still need a time against age graph with a poly-fit, preferably in SVG.
Posted by: stelt | September 30, 2008 2:54 PM
Kingsley has sent me version of this query that uses an OpenLink extension to do the age calculation as part of the SPARQL query, without the need for the XSLT part. Click http://tinyurl.com/48l6c6 to see the query and execute it. Very cool.
Posted by: Bob DuCharme | October 1, 2008 10:48 AM
Hi Bob,
fantastic stuff! The issue you were having with different rdf properties of the same relation will be solved shortly. I'll release a new version of the infobox dataset (based on a new extraction approach) in the next days.
Cheers, Georgi
Posted by: Georgi Kobilarov | October 5, 2008 7:22 AM
Due to underlying engine change and DBpedia instance update, here is the revised query:
SELECT ?presName, ?birthday, ?startDate, (bif:datediff("year", ?birthday, xsd:date(bif:sprintf("%d-01-20", ?startDate)))) as ?age_at_innaguration
WHERE {?presName skos:subject ;
dbpedia2:birth ?birthday;
dbpedia2:presidentStart ?startDate.
filter (datatype(?startDate) = xsd:integer)
}
Live Link: http://tinyurl.com/4edjzl
Posted by: Kingsley Idehen | October 5, 2008 10:36 AM