Mapping relational data to RDF with D2RQ
Getting more URIs into your triples' objects, and why this is good.
Last week I mentioned the role that D2RQ played in a project I was working on, and I wanted to write a little more about this RDBMS/RDF interface if it's any help to people who may use it. D2RQ is free, and it's easy to use in its default setup, but I'm finding that the further you stray from the default setup, the more you can do with it.
The main mistake I made at the beginning was that when I created a small dummy database for my first mapping attempt, I didn't declare a key value for the database's only table. SQL doesn't require this, but D2RQ does, and for a very good reason: the key value becomes the subject of that triples that express the data values in each row of the table. The D2RQ mapping file lets you configure the URIs used in the triples, so a row from a book publishing database table where the ISBN column is the key and has a value of "0553213113" and the title column has a value of "Moby Dick" can generate a (subject, predicate, object) triple of (<http://foo-url/isbn/0553213113>, <http://bar-url/title>,"Moby Dick").
While RDF subjects and predicates must be URIs, objects can be string values like "Moby Dick", but they can also be URIs. A URI as an object can serve as the subject of other triples, letting you link up information to get more value out of it. (More on this in an article I did for Dr. Dobb's.)
D2RQ turns most database values into literal strings when it plugs them into the "object" slot of an RDF triple statement, but it does know one kind of value in a relational database that's better off being represented as a URI when it's the object of a triple: foreign key values. For example, let's say that in a sales database you set the custID column of an orders table to be a foreign key referencing the custID column of the customers table, because you don't want any orders entered with custID values that aren't in your list of customers. When D2RQ asks your database package about that database, upon finding out that orders.custID is a foreign key, it sets up the mapping file so that RDF triples showing an order's custID value will represent it in a triple's object as as a URI, not a as string like the "Moby Dick" object of the triple shown above. With custID as the object of triples from the orders table and the subject of triples from the customers table (and similarly, with the orders table's second foreign key column itemID as the object of other triples from that table and the subject of additional triples from the items table), the right SPARQL query can show us that order o003, in which customer c002 ordered item i004, really means that customer John Lennon bought an Epiphone Casino guitar. (When I create a customer database of four names, I pick four obvious ones.)
Of course, this is even easier in SQL—it's what relational databases are designed for—but the point of my exercise is to set up an RDF version of relational data so that I can see if OWL lets us do useful things with the data that I couldn't do when the data was strictly relational. Which leads to the next D2RQ objects-as-URIs trick: the mapping file's d2rq:uriPattern property.
In a comment on my last posting, Richard Cyganiak suggested that setting an e-mail address field in my address book databases to be an inverse functional property would add metadata that gave more value to the database. To cut to the chase, it worked; because of this property, Pellet can now tell that Bobby Fisher of 2304 Eighth Lane and Robert L. Fisher of 2304 8th Ln. are the same person. (Despite my use of Beatle names in four-person databases, I swear that my random personal data generator just happened to come up with the name of the wacky former chess champion.) The identification of potentially redundant names is a key feature of some expensive products out there—you don't like getting two copies of the same catalog with your name spelled slightly differently, and the retailer doesn't want to pay for sending you two. So, the case for the value of owl:InverseFunctionalProperty is pretty clear.
When you're using OWL DL and you want to say that a property is an inverse functional property, the value must be a URI and not a literal string value. (Explanations of the differences between OWL Lite, OWL DL, and OWL Full have too much knowledge representation jargon for me to remember those differences, but since OWL DL is more powerful than Lite and Full is apparently difficult to develop software for, the OWL software developers' Goldilocks approach of going with the one in the middle has me working with OWL DL.) While D2RQ can tell on its own that foreign key values should be represented as URIs, it needs to be told explicitly if you want e-mail addresses to be represented as URIs, and this means a few simple changes to the mapping file generated by D2RQ's generate-mapping utility.
The following shows the map file entry for the email1 field of the entries table in my MySQL eudora database. The mapping files use the N3 dialect of RDF. The first four lines were generated by D2RQ's generate-mapping utility. I commented out the fourth line, which declares email.entries to be a data property, and added the two new lines below it. The d2rq:uriPattern line says that entries.email1 should be treated as a URI, and it plugs that value into a string beginning with "mailto:" so that it really is a URI. The line after that says to only bother with this mapping if the entries.email1 value is not equal to an empty string. (the "<>" here means "not equal to", so don't confuse it with more common XML and N3 use of the pointy brackets.) Without this d2rq:condition property, D2RQ creates a URI of just "mailto:" for blank email1 values. Pellet, with good reason, doesn't like these.
emap:entries_email1 a d2rq:PropertyBridge; d2rq:belongsToClassMap emap:entries; d2rq:property eud:entries_email1; # d2rq:column "entries.email1"; d2rq:uriPattern "mailto:@@entries.email1@@"; d2rq:condition "entries.email1 <> ''"; .
Once the email1 values are treated as URIs, SWOOP (one of the Goldilocks software packages I mentioned earlier) lets me set email1 to be an inverse functional property. As handy as the SWOOP interface is, it's not that difficult to add the bolded line below to your OWL ontology using a text editor:
<owl:ObjectProperty rdf:about="http://localhost:2020/resource/eudora/entries_email1"> <rdf:type rdf:resource="http://www.w3.org/2002/07/owl#InverseFunctionalProperty" /> <rdfs:subPropertyOf rdf:resource="http://localhost:2020/resource/entries/email" /> </owl:ObjectProperty>
(By the way, tools like SWOOP and Protégé are great for automating the writing and editing of RDF/OWL code, but only because the code is so verbose and redundant, not because it's particularly difficult to understand. When I hear W3C Schema and XBRL advocates say "sure, the syntax is convoluted, but don't worry about it, because the tools will take care of it" little alarms go off in my head—if I can't understand the syntax used to model some information, I don't want to have to take it on faith that the model is good. There's no reason for RDF/OWL syntax to set off such alarms, and using these tools to generate the syntax and then reviewing the syntax is a great way to learn that syntax, but ultimately, you'll get more work done more quickly if you use the tools.)
To cut back to the chase I mentioned earlier, following these steps to designate email1 as an inverse functional property made it possible for Pellet to know that because the Bobby Fisher of 2304 Eighth Lane and the Robert L. Fisher of 2304 8th Ln in my MySQL database have the same email address, they're the same person—giving me one more example of how RDF/OWL can help me to get more out of a traditional relational database. I'd love to hear more suggestions for things I can add to this ontology that let SPARQL queries find out things that straight SQL queries could not get from the same data.