Simple semi-structured data entry

With RDF.

When most people want to take notes on a collection of things, and they know that the notes will have some structure but they're not sure about the nature of that structure just yet, they use a spreadsheet. For each thing that they take notes on, they add a new row; for each attribute of the things under review, they add a column. From an investment banker comparing potential investments to a scout leader planning a camping trip, the grid makes it easy for you to compare similar attributes of different things without forcing to you to specify all of your attributes before starting your data entry like a more serious database application would.

In theory, RDF is ideal for this, because you can assign any attribute name/value pair to any resource that you can identify with no requirement to plan it all in advance, but in practice, it's rarely as easy as pouring names and numbers into a spreadsheet. I've often thought that it would be fun to build a freeform database program that lets people do data entry and make up new fields as they go along, all with RDF underneath. I even wrote some Python code for this a few years ago, but never followed through. Since joining TopQuadrant, I've wondered about assembling something like this with the company's application development tools, but then I realized that the Free Edition of TopBraid Composer pretty much already does this.

Here's a use case that's happened to most people in the modern workforce: you're told that you'll be joining a particular project, and to get you started someone emails a zip file of relevant files for you to review. For my notes on these files, I might create a text file or a spreadsheet, but I'd probably assemble an XML file where I made up element names as I went along. These elements would track the filename, document title, author, age, comments, and probably some project-specific fields. When the big picture starting coming into focus, I'd write a little XSLT to convert this XML to presentable HTML to show to others if necessary.

A key reason that this would be easy for me is that the Emacs nxml mode automates much of the work of entering tags and keeping everything well-formed. How would doing it in RDF be better? I could do the same steps as above using RDF-Friendly XML and nxml's excellent handling of RDF/XML, but I'd rather use a form-based interface instead of Emacs. This is where the free edition of TopBraid Composer comes in.

The first step is creating an RDF data file with all the easily available file metadata: the name, size, and last modification date for each file. I wrote a simple perl script called dir2rdf.pl to do this; it's simple because it declares a File class and all the properties for that class in the namespace declared for the file. (I also created a slightly more complex perl script called dir2nfordf.pl which does the same thing but uses existing classes and properties from the NEPOMUK File Ontology. It's more complex because this ontology has properties based on properties from other vocabularies such as Dublin Core, so editing data with this ontology means pulling in a few layers of other ones.)

When you pipe the result of the Windows dir command into the simpler perl script, it outputs the property and class definitions for the files and an entry like this for each file:

  <File rdf:ID='file11' sd:lastModified='2009-10-30T17:05:00'
        sd:fileName='teams.csv' sd:fileSize='164' rdfs:comment=''/>

Loaded into the free edition of TopBraid Composer, the editing of that "record" looks like this (I've rearranged the combination of screen sections a bit from the default TopBraid "perspective", to use the Eclipse parlance):

TopBraid Composer screen shot

I can edit the values on this form, although there's no reason to edit the file name, size, or last modified values. What I'm really going to do is add notes to the rdfs:comment property, as I've already done above, and perhaps add more comment properties for this resource. The really nice part is that I can define new properties in the Properties view on the right—for example, some project-specific subproperties of rdfs:comment—drag them onto the form for any of my File resources, and then add values to them, giving me the functional equivalent of adding new columns to a spreadsheet.

It's actually better than that, because if I wanted to add three contactWithQuestions names to one of these File resources on a spreadsheet grid, I'd have to either add three columns or string together three values in one spreadsheet cell as if they were one. With RDF, though, I can define a contactWithQuestions property and then add three separate values for this property to the same resource. Moving beyond the use of simple string data for the values here, I could create object properties (properties where the value is another resource—in this case, to define relationships between File objects such as mentionedIn or basedOn) by defining them in the Properties view on the right with a range of File. When I want to assign one of these properties to a particular File object, I would drag it from the property list on the right onto the Resource form for that File and then pick out the appropriate file it refers to from a drop-down list. For example, after creating a mentionedIn property, if teams.csv was mentioned in index.html and I wanted to record this in my notes on teams.csv, I'd drag the mentionedIn property onto the Resource Form for teams.csv and select index.html as the value for that property.

Because this is a GUI editing interface, I can also add and delete new File resources (the equivalent of inserting and deleting rows on a spreadsheet) by clicking icons on the Instances view at the bottom. (Another nice bonus with TopBraid Composer is the SPARQL tab next to that, where you can enter and run SPARQL queries about the data.)

So, I've got my form-driven interface that I can use with any RDF data. I've kept my address book in RDF for a long time; maybe I should try maintaining it like this instead of with Emacs.

1 TrackBacks

Listed below are links to blogs that reference this entry: Simple semi-structured data entry.

TrackBack URL for this entry: http://www.snee.com/cgi-sys/cgiwrap/bobd/managed-mt/mt-tb.cgi/554

new blog entry (after MovableType blog software being broken for a few days) "Simple semi-structured data entry" http://bit.ly/3dgzSC Read More