Pipelining SPARQL queries in memory with the rdflib Python library

Using retrieved data to make more queries.

Last month in Dividing and conquering SPARQL endpoint retrieval I described how you can avoid timeouts for certain kinds of SPARQL endpoint queries by first querying for the resources that you want to know about and then querying for more data about those resources a subset at a time using the VALUES keyword. (The example query retrieved data, including the latitude and longitude, about points within a specified city.) I built my demo with some shell scripts, some Perl scripts, and a bit of spit and glue.

I started playing with RDFLib's SPARQL capabilities a few years ago as I put together the demo for Driving Hadoop data integration with standards-based models instead of code. I was pleasantly surprised to find out how easily it could run a CONSTRUCT query on triples stored in memory and then pass the result on to one or more additional queries, letting you pipeline a series of such queries with no disk I/O. Applying these techniques to replace my shell scripts and Perl scripts from last month showed me that these same techniques could be used for all kinds of RDF applications.

When I was at TopQuadrant I got to know SPARQLMotion, their (proprietary) drag-and-drop system for pipelining components that can do this sort of thing. RDFLib offers several graph manipulation methods that can extend what I've done here to do many additional SPARQLMotion-ish things. When I recently asked about other pipeline component-based RDF development tools out there, I learned of Linked Pipes ETL, Karma, ld-pipeline, VIVO Harvester, Silk, UnifiedViews, and a PoolParty framework around Unified Views. I hope to check out as many of them as I can in the future, but with the functions I've written for my new Python script, I can now accomplish so much with so little Python code that my motivation to go looking beyond that is diminishing--especially considering that when doing it this way, I have all of Python's abilities to manipulate strings and data structures standing by in case I need them.

For me, the two most basic RDF tasks to augment the general Python capabilities are retrieval of triples from a remote endpoint for local storage and querying of locally stored triples. RDFLib makes the latter easy. For the former I was looking for a library, but Jindřich Mynarz pointed out that no specialized library was necessary; he even showed me the basic code to make it happen. (I swear I had tried a few times before posting the question on Twitter, so the brevity and elegance of his example were a bit embarrassing for me.)

You can find my new Python script to replace last month's work on github. More than half of it is made up of the actual SPARQL queries being stored in variables. This is a good thing, because it means that the Python instructions (to retrieve triples from the endpoint, to load up the local graph with retrieved triples, to query that graph, and to build and then run new queries based on those query results) all together take up less than half of the script. In other words, the script is more about the queries than about the code to execute them.

The main part of the script isn't very long:

# 1. Get the qnames for the geotagged entities within the city and store in graph g. 

queryRetrieveGeoPoints = queryRetrieveGeoPoints.replace("CITY-QNAME",cityQname)
url = endpoint + "?" + urllib.urlencode({"query": queryRetrieveGeoPoints})
g.parse(url)
logging.info('Triples in graph g after queryRetrieveGeoPoints: ' + str(len(g)))

# 2. Take the subjects in graph g and create queries with a VALUES clause 
#    of up to maxValues of the subjects. 

subjectQueryResults = g.query(queryListSubjects)
splitAndRunRemoteQuery("querySubjectData",subjectQueryResults,
                       entityDataQueryHeader,entityDataQueryFooter)

# 3. See what classes are used and get their names and those of their superclasses.
classList = g.query(listClassesQuery)
splitAndRunRemoteQuery("queryGetClassInfo",classList,
                       queryGetClassesHeader,queryGetClassesFooter)

# 4. See what objects need labels and get them.
objectsThatNeedLabel = g.query(queryObjectsThatNeedLabel)
splitAndRunRemoteQuery("queryObjectsThatNeedLabel",objectsThatNeedLabel,
                       queryGetObjectLabelsHeader,queryGetObjectLabelsFooter)

print(g.serialize(format = "n3"))   # (Actually Turtle, which is what we want, not n3.)
         

The splitAndRunRemoteQuery function was one I wrote based on my prototype from last month.

I first used RDFLib over 15 years ago, when SPARQL hadn't even been invented yet. Hardcore RDFLib fans will prefer the greater efficiency of its native functions over the use of SPARQL queries, but my goal here was to have SPARQL 1.1 queries drive all the action, and RDFLib supports this very nicely. Its native functions also offer additional capabilities that bring it closer to some of the pipelining things I remember from SPARQLMotion. For example, the set operations on graphs let you perform actions such as unions, intersections, differences, and XORs of graphs, which can be handy when mixing and matching data from multiple sources to massage that data into a single cleaned-up dataset--just the kind of thing that makes RDF so great in the first place.

Picture by Michael Coghlan on Flickr (CC BY-SA 2.0)


Please add any comments to this Google+ post.