Querying my own MP3, image, and other file metadata with SPARQL

And a standard part of Ubuntu.

Ubuntu has a utility called Tracker that makes it easy to search your hard disk, a bit like the old Google Desktop with a few extra features. One extra feature ranks among the coolest SPARQL applications I've ever seen: the ability to execute SPARQL queries against data extracted from files on your hard disk.

Anarchy paper lantern

To install it, I did a sudo apt-get install of tracker-gui to get the base parts of tracker and then did a similar installation of tracker-utils to get the SPARQL query utility. Next, I added the Ubuntu applications "Desktop search" and "search and indexing" as applications and used the latter to search and index 94 GB of MP3s and some image files. The indexing took a few hours. (tracker-control -S was a handy command for checking on the indexing progress.) The worldofgnome.org page Indexing preferences in GNOME 3.8 was helpful for understanding the indexing options.

Once the file metadata is indexed, the tracker-sparql command-line utility lets you query it. For example, the following runs the query stored in bea.spq against the metadata:

tracker-sparql -f bea.spq

(The tracker-sparql help said that I was also supposed to include -q to show that it was a SPARQL query, but it seemed to work fine without this command line switch.) The following shows bea.spq, a query for artist names that begin with "Bea", allowing for an optional "The " before that:

PREFIX nmm: <http://www.tracker-project.org/temp/nmm#>
SELECT DISTINCT ?artistName WHERE {
        ?artist a nmm:Artist . 
       ?artist nmm:artistName ?artistName .
       FILTER(regex(?artistName,"^(The )?Bea"))
}

Here is the output:

Results:
  Beachwood Sparks
  Beastie Boys/Beck/Dust Brothers
  Beastie Boys/Dust Brothers
  Beatles
  The Beach Boys
  The Beastie Boys
  The Beatles
  The Beatniks

One frustrating thing about tracker-sparql is that it rejects certain queries because, as it tells us, "Unrestricted predicate variables not supported." In my experience, this meant that you couldn't have a variable in a triple pattern's predicate position if there was another one in the subject position. So, for example, while I know that the Dust Brothers have worked with the Beastie Boys and Beck separately, I've never heard of all of them working together, but I couldn't enter a query to see which work was created by an artist with a nmm:artistName value of "Beastie Boys/Beck/Dust Brothers". I did try dc:contributor, nmm:performer, and some other properties that were used to connect an artist to a work, but with no luck. (My guess: it was some sort of remix that combined a few Dust Brothers works.)

This was a fun query, asking what values of "genre" were stored in my MP3s:

SELECT DISTINCT ?genre WHERE
{
  ?work nfo:genre ?genre
}

The results:

Results:
  Jazz
  Rock
  Classical
  New Wave
  Avantgarde
  Pop
  Salsa
  Blues
  Soundtrack
  RETRO SWING
  Swing
  Country
  Other
  Sound Clip
  jazz
  Latin
  Lo-Fi
  Rock & Roll
  Hip-Hop
  Techno-Industrial
  Euro-Techno
  Booty Bass
  Alternative
  Reggae
  Indian
  Podcast
  Electronic

This can lead to a real rabbit hole of additional queries as I wonder "what do I have in that category?" but I'll spare you that part.

tracker-sparql has a few command line options that are shortcuts to common queries for exploring a dataset. For example, -c lists classes, and gave me a list of 230. A query for distinct rdf:type values showed only 67 being used in my file metadata, so I assume that -c refers to classes that are declared in an internal schema. The tracker-stats utility shows how many instances each class has. (The "SEE ALSO" section of the help page for tracker-store had the best list I could find of the various tracker utilities.)

The tracker indexer also pulls fairly typical metadata out of image files. Unfortunately, it doesn't pull latitude and longitude data out when present, but it does let you add and query tag values in images. I played with this using the image file above, which shows a paper lantern with the anarchy symbol that I saw in San Francisco's Chinatown during the 2010 Semantic Technologies conference. Using the tracker-tag utility, I added a tag to the image like this:

tracker-tag --add=anarchy /my/path/semtech/2010/pics/IMG_5257.jpg

This added the following triples to the dataset:

@prefix nao:  <http://www.semanticdesktop.org/ontologies/2007/08/15/nao#> . 
@prefix tr:   <http://www.tracker-project.org/ontologies/tracker#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . 
@prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . 
@prefix nao:  <http://www.semanticdesktop.org/ontologies/2007/08/15/nao#> . 

<urn:uuid:5aa32bbc-7f08-da08-3bbd-8ae6650411fb> nao:hasTag  
  <urn:uuid:a49c693c-d439-529b-8e27-296d589e905c> . 

<urn:uuid:a49c693c-d439-529b-8e27-296d589e905c>
  tr:added "2014-01-18T22:31:44Z" ;
  tr:modified 7170 ;
  rdf:type rdfs:Resource ;
  rdf:type  nao:Tag ;
  nao:prefLabel "anarchy" . 

The first triple says that the image resource has a particular tag, and the remaining triples tell us about that tag. It was nice to see that the tag is a resource and not just a string, so it can be renamed without losing its relationships with tagged resources. It also means that the tag itself can have additional metadata assigned to it such as skos:broader values to create a taxonomy hierarchy. And of course, there are all kinds of possibilities for SPARQL queries about what is tagged with what. (It would be fun to pull a set of nao:Tag resource triples into TopBraid EVN and really turn them into a proper SKOS taxonomy.)

A few random closing notes:

  • I tried a few SPARQL 1.1 features like BIND and contains() with no luck, but the tracker-sparql help page does show that the count() function and SPARQL UPDATE are supported. I tried adding a triple with an UPDATE request, but I didn't get it to work. If it was possible to add arbitrary triples about existing resources, we could store additional data about them such as the skos:broader values mentioned above and triples about the latitude and longitude where the picture was taken, which ExifTool can extract from image files. Apache Tika, which I've written about here before, would also be great to throw into the mix.

  • It's interesting that the resources were identified with URNs instead of URLs.

  • The Adrian Perez blog post Some Tracker + SPARQL bits has some good tips, and it points to two blog entries by Adrien Bustany that describe some nice predicate functions built into Tracker's SPARQL engine.

  • It was nice to see the Nepomuk ontology used here. Talk about a semantic desktop! (Since writing the first draft of this, I have learned that the next generation of Nepomuk is not using RDF, which I was sorry to hear.) It would be nice to see a schema for the Tracker-specific classes and properties; the http://www.tracker-project.org/ontologies base URI used for some of the namespaces currently doesn't go anywhere. (If someone can point me to such a schema, I'd be happy to update this.)

  • The metadata that the indexer pulled from a PDF on my hard disk included the complete text of the PDF stored using the nie:plainTextContent property. That could be very useful for searches and text extraction.

Playing with this dataset, if I limited myself to SPARQL queries about my own MP3s, I could stay busy for hours. Assigning, querying, and curating tags (while I assigned one to a JPEG file above, they could be assigned to any resources), as I mentioned above, is something else that would be a lot of fun to play with. For example, imagine running some text analytics on nie:plainTextContent values to come up with tag values to assign to that PDF. And, if music files have an artist property and PDFs have a plainTextContent property, there are probably plenty of other properties that are specific to certain file types and reveal interesting things about them—especially when queried with SPARQL to find patterns among the values of the files in your own collection.


Please add any comments to this Google+ post.