Pulling RDF out of MySQL

With a command line option and a very short stylesheet.

When I wrote the blog posting My SQL quick reference last month, I showed how you can pass an SQL query to MySQL from the operating system command line when starting up MySQL, and also how adding a -B switch requests a tab-separated version of the data. I did not mention that -X requests it in XML, and that this XML is simple enough that a fifteen-line XSLT 1.0 spreadsheet can convert any such output to RDF.

I've written before about how tools like the open source D2RQ and Capsenta's Ultrawrap provide middleware layers that let you send SPARQL queries to relational databases--and to combinations of relational databases from different vendors, which is where the real fun begins. This command line stylesheet trick gives you a simpler, more lightweight way to pull the relational data you want into an RDF file where you can use it with SPARQL or any other RDF tool.

If you have MySQL and xsltproc installed, you can do it all with a single command at the operating system prompt:

mysql -u someuser --password=someuserpw -X -e 'USE employees; SELECT * FROM employees LIMIT 5' | xsltproc mysql2ttl.xsl -

(Two notes about that command line: 1. don't miss that hyphen at the very end, which tells xsltproc to read from standard in. 2. I added the LIMIT part for faster testing because the employees table has 30,024 rows. To come up with that number of 30,024, I had to look at my last blog entry to remember how to count the table's rows, so writing out that quick reference has already paid off for me.) The XML returned by MySQL looks like this, with data from subsequent rows following a similar pattern:

  <resultset statement="SELECT * FROM employees LIMIT 5"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <row>
	<field name="emp_no">10001</field>
	<field name="first_name">Georgi</field>
	<field name="last_name">Facello</field>
	<field name="birth_date">1953-09-02</field>
	<field name="gender">M</field>
	<field name="hire_date">1986-06-26</field>
	<field name="department">Development</field>
  </row>

I thought the inclusion of the query as an attribute of the resultset attribute was a nice touch. The following XSLT stylesheet converts any such XML to Turtle RDF; you'll want to adjust the prefix declarations to use URIs more appropriate to your data:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:output method="text"/>

<xsl:template match="resultset">
  @prefix v: &lt;http://learningsparql.com/ns/myVocabURI/> . 
  @prefix d: &lt;http://learningsparql.com/ns/myDataURI/> . 
      <xsl:apply-templates/>
    </xsl:template>

        <xsl:template match="row">
d:<xsl:value-of select="count(preceding-sibling::row) + 1"/> 
          <xsl:apply-templates/> . 
        </xsl:template>

    <xsl:template match="field">
      v:<xsl:value-of select="@name"/> "<xsl:value-of select="."/>" ;
    </xsl:template>

</xsl:stylesheet>

The result includes some extra blank lines that I could suppress with xsl:text elements wrapping certain bits of the stylesheet, but a Turtle parser doesn't care, so neither do I:

  d:1
	
      v:emp_no "10001" ;
    
	
      v:first_name "Georgi" ;
    
	
      v:last_name "Facello" ;
    
	
      v:birth_date "1953-09-02" ;
    
	
      v:gender "M" ;
    
	
      v:hire_date "1986-06-26" ;
    
	
      v:department "Development" ;
    
   .

You can customize the stylesheet for specific input data. For example, the URIs in your triple subjects could build on an ID value selected from the data instead of building on the position of the XML row element, as I did. As another customization, instead outputting all triple objects as strings, you could insert this template rule into the XSLT stylesheet to output the two date fields typed as actual dates, as long as you remembered to also add an xsd prefix declaration at the top of the spreadsheet:

    <xsl:template match="field[@name='birth_date' or @name='birth_date']">
      v:<xsl:value-of select="@name"/> "<xsl:value-of select="."/>"^^xsd:date ;
    </xsl:template>

Or, you could leave the XSLT stylesheet in its generic form and convert the data types using a SPARQL query further down your processing pipeline with something like this:

PREFIX v: <http://learningsparql.com/ns/myVocabURI/> 
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

CONSTRUCT {
  ?row v:birth_date ?bdate ;
       v:hire_date ?hdate . 
}
WHERE {
  ?row v:birth_date ?bdateString ;
  v:hire_date ?hdateString . 
  BIND(xsd:date(?bdateString) AS ?bdate)
  BIND(xsd:date(?hdateString) AS ?hdate)
}

However you choose to do it, the nice thing is that you have lots of options for grabbing the massive amounts of data stored in the many MySQL databases out there and then using that data as triples with a variety of lightweight, open source software.

Please add any comments to this Google+ post.

bobdc.blog

Bob DuCharme's weblog, mostly on technology for representing and linking information.

Pulling RDF out of MySQL

Search