Integrating relational databases with RDF/OWL
Done, if on a fairly small scale.
I recently asked about the use of RDF/OWL to integrate databases, especially relational databases. The posting received many good comments, but no pointers to the kind of simple example I was hoping to find, so I've managed to create one myself.
I loaded two different single-table address book databases into MySQL, with one based on Outlook's structure and field names and the other based on Eudora's. (I've already written here about how I generated the data.) I used D2RQ to treat the two databases as RDF and I used SWOOP to generate an ontology, with equivalence rules such as that the eud:entries_workState field from one database was equivalent to the ol:entries_businessState field from the other.
To find everyone in the address book with a business address in New York state, a SPARQL query using Pellet can ask for all entries for which eud:entries_workState="NY" and then also get the entries from the other database for which ol:entries_businessState="NY". By defining all the different phone number properties (home, work, mobile, fax, etc.) as subproperties of "phone", I can also query for someone's phone numbers without knowing exactly which kinds of phone numbers the database has for that person, and see them all listed. To me, these both demonstrate how metadata can add value to data, because they let me get answers to practical questions about my data that are more complete than these answers would have been without the metadata.
To use D2RQ as an interface to a MySQL database, you first run its utility that queries the database for database catalog information and generates a mapping file. Then, you start up the D2RQ server with the mapping file as a parameter, so that when you issue SPARQL queries against its server it can map your queries to the appropriate SQL queries to pull the data out. (I did this all on a Windows machine, by the way, and have no reason to believe that any of it would be different on a Linux machine.) To integrate two databases, I generated a mapping file for each and then combined the two mapping files into one that I used when I started the D2RQ server before making my Pellet queries. I had to rename some namespace prefixes, but the mapping file syntax was pretty easy to understand, as is the spec that describes their syntax. I'd like to especially thank Richard Cyganiak, who patiently answered my questions on the d2rq-map-devel mailing list. (I'm looking forward to checking out the new features of the D2RQ upgrade, which I just learned about this morning.)
To query the data, I have a script pull all the latest data from the D2RQ server into a file where it's added to the ontology information (which I created with SWOOP, as described below), and then Pellet queries that. Integrating additional RDF-based sources would be easy, whether they come via D2RQ or not; just add them to the same file before querying. This probably won't scale way up, and some digging into the Pellet and D2RQ APIs should make it possible to integrate them more closely for this querying. My next step is to use this same routine to integrate some multi-table databases and to get some non-string data in there. At some point I'll write up how I did all this in more detail and make all the files available so that someone can reproduce it on their own. Meanwhile, I'll probably propose it as a conference topic someplace.
I love how, when you pick "Load" and then "Ontology" from SWOOP's File menu, if you load an RDF file with no ontology information defined, it declares all the predicates it finds as properties. You can save the file without doing anything else, open it up in a text editor, and see the properties all defined there in RDF/OWL syntax. Using SWOOP, you can then specify equivalence and property/subproperty relationships just by pointing and clicking, and then save again. After doing this with the combined address book databases, I pulled the ontology definitions out into a separate file so that when the relational data was updated, I could use D2RQ to pull updated RDF and apply the same ontology definitions to it when querying the new data.
I'd love to hear suggestions about additional OWL constructs that can let queries pull information from a database that they couldn't have found without that OWL metadata. This last qualification is important—for example, while I understand the domain and range concepts (at least with pizza; PDF), I'd like to know a way that defining the domain or range of a property could let a query do more than it could before these were defined.
For now, I'm pointing and clicking with a free tool to define an ontology that adds value to existing data, and I'm psyched.
Comments
(Note: I usually close comments for an entry a few weeks after posting it to avoid comment spam.)
Hello, good article (and I'm glad I stumbled across it on planet RDF). My question is, consider you have a client-side application that sends a large number of queries to the D2RQ server using SPARQL. Have you given any thought to there being some caching layer in between the relational database and the D2RQ server? Or does this already exist? Alternatively, is the response speed good enough such that a cache is not overly necessary?
I was previously considering a solution where a Jena OWL ontology was populated from a database, which would act as a cache. It would still handle queries using SPARQL however. This blog post suggests a slightly different approach that I would be interested in learning more about.
Cheers,
Jonathan Giles.
Posted by: Jonathan Giles | October 30, 2006 4:10 PM
Cool stuff – so this is what you've been up to. We've been looking at integrating D2RQ instances through federated SPARQL queries, and it's nice to see an inference-based working as well.
Things you could do with OWL, related to data integration: With IFPs, you can infer that people with the same email address from two different DBs are the same person. With subclass relationships, you can infer that a tech report from DB1 and a press release from DB2 are both documents, and therefore queries for documents shall return both.
Jonathan: If you have enough memory, then loading everything into an in-memory model before doing inference will always be much faster. Databases are OK for doing SPARQL queries, but are no good for doing OWL inference.
Posted by: Richard Cyganiak | October 30, 2006 4:46 PM
Thanks Richard! Jonathan--my plan is to continue my research more horizontally than vertically, i.e. to investigate the potential role of other tools before I try to add a lot of scale and efficiency to the system created with this particular set of tools. Mostly, I want to push the possible rule that OWL can play in this system.
Posted by: Bob DuCharme | October 30, 2006 5:14 PM