Some use cases to implement using SPARQL graphs

Or not; I'm open to suggestions.

As I wrote in my last entry, I've recently figured out how to assign metadata to RDF graphs and to perform SPARQL queries on sets of those graphs. I'm working a bit backwards here, because I'm now moving on to the use cases that got me thinking about this in the first place. It's easier to think about them now that I know that I can implement them using standard syntax and multiple open source implementations of that standard. I wanted to outline my ideas about how to implement these use cases to see if they sound particularly good or bad to others. They're general enough that they'll apply to other situations.

Simple aggregation of distributed data

Let's say I have a collection of RDF data that mirrors several sets of data on the Internet. I want to query the aggregate set without retrieving every set from its original source with every query. It's not very time-sensitive data, so updating the central collection once every 24 hours is fine. "Updating" is the key operation here; if someone deletes a triple from one of the satellite collections, I want to be confident that it won't be in my aggregate collection the next day, so here's what I would do.

I name each graph in my internal collection after the source of its triples. To update the data from source http://www.greatdata.org/latest.rdf, a cron job does the following at 3:14 AM each morning:

  1. Delete the triples in the http://www.greatdata.org/latest.rdf graph in my collection.

  2. Load the latest data from http://www.greatdata.org/latest.rdf into the graph with that name in my collection.

  3. Add some triples like the following to a graph dedicated to tracking such downloads:

    <http://www.greatdata.org/latest.rdf>
    <http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#fileLastAccessed>
    "2009-03-15T03:14:52-0500".
            

Wiping out a set of data and completely replacing it will only scale up to a certain point, and a SPARQL UPDATE ability will be a better way to implement certain variations on this, but if the total aggregate size is just a few dozen megabytes, the general approach above makes sense to me. Does it look horribly wrong to anyone else?

Identifying a triple's provenance

This time, instead of replacing each graph with a more updated version, I want to aggregate all the downloaded data as it accumulates. I assign each downloaded batch its own graph URL and assign metadata to this new graph such as the source, date, and time of the retrieval. I could also assign it rdfg:subGraphOf values, depending on which sets of graphs I was defining for querying, updating, and access control purposes.

To move on to a usage scenario, let's say that Kendall Clark queries a service on snee.com and finds this triple:

<http://clarkparsia.com/weblog/2008/10/31/we-won/> dc:creator "Bijan Parsia".
      

He contacts me and says "Bijan didn't write that! I did! Where are you getting this data?" I check and see that this triple is part of the named graph http://www.snee.com/ns/graphids#i23F2A9, so I query the metadata associated with this named graph and find this:

        <http://www.snee.com/ns/graphids#i23F2A9> 
        dc:date "2008-11-01T17:37:00";
        dc:source <http://planetrdf.com/index.rdf>.
      

I tell Kendall that I got that triple from the Planet RDF RSS feed at 5:37 PM GMT from the Planet RDF feed.

Again, does the general outline of what I describe here make sense, or would there be a better way to approach it?

4 Comments

Hi Bob,

Jeni wrote an interesting piece on this the other day. Lots of relevant comments, too.
Not speaking of my comment, of course! ;-) However the reason I'm referring to this is that I put my ideas on using the HTTP vocabulary in there which I think is relevant to your second use case. It's restricted to cases where you dereference HTTP URIs but in those cases it gives you very detailed control.
Other relevant vocabularies: http://tw.rpi.edu/2008/sw/archive.owl# http://web.resource.org/rss/1.0/modules/syndication/ http://wiki.foaf-project.org/ScutterVocab
Hmm, lots of links. Usually that's the point where my comment ends up in the spam box. ;-)

Simon


Thanks Simon! I've certainly been following Jeni's work there. The HTTP vocabulary looks very useful, although a namespace prefix of http is bound to be confusing--an end-tag with /http:httpVersion between the <> could be pretty confusing to people who aren't hardcore markup geeks.

The other vocabularies also look useful, but I have to wonder if some spokes of the Library of Congress MARC-based metadata wheels (e.g. METS, EAD) got reinvented in there. If so, they'll make great demos for OWL equivalency predicates...

Bob


Your description here on the provenance/named graphs is good...but I don't think I like named graphs as the solution to provenance, because any given triple could flow through many graphs...ie, tracking down the genesis of the triple is more or less archeology...or even worse, trying to track the ownership of a penny.

What if each triple were really a quadruple, the 4th item being a URI to the authority that gave birth to the statement. That URI might also point to a thing that is an instanceof "provenance node" with more stuff like the time, and person who asserted the fact, or if it's inferred.


great post.. thanks

No Prescription Needed