I read Elliot Kimber's series on XML content management software as it came out, and I've been re-reading it lately for work project reasons. We work at the same company, where content management issues come up a lot. Content Management Systems is also one of those software categories where many products claim to do it all, but what exactly constitutes "it all" is very vague. Each vendor makes up their own features and puts their own spin on the au courant buzzwords, making it difficult to compare different products. Elliot's approach of treating a basic CMS as a source control system like Subversion plus x, y, and z and then analyzing what x, y, and z should be make it easier to sort expectations of a CMS system.
I've also been getting to know Subversion. One nice thing that I learned when someone pointed it out to Elliot is that it can store arbitrary metadata, even passing my Arbitrary Metadata Test by letting me assign a goofinessFactor of 3.1416 to one file.
Unfortunately, Subversion doesn't let you search for files based on metadata values. I see the value of finding an object and then looking at its metadata values, but I want the ability to search the metadata values to find objects. There are ways to add this in, but first let's address the always important question: why bother?
Subversion + ? = (CMS | DAM)
I thought that I would learn a lot by adding whatever to Subversion to build a simplified CMS. (Subversion hook scripts make it easy to trigger python scripts upon events such as check-in.) Elliot makes it clear that link management is important in a CMS if you want to dynamically create documents from stored pieces and track dependency relationships—and for typical CMS use, you definitely want to do the former and probably the latter—but I didn't want to add that much to Subversion, so I thought of some lower hanging fruit: a Digital Asset Manager, a project that also gives me the benefit of a cheap pun to use. (Years ago, my future wife and I saw that upon finishing our group's tour of the Hoover Dam, the older gentleman leading the tour clearly enjoyed saying "Thanks for taking the Dam(n) tour!") Like so many people doing semantic web related development, I could start by creating Yet Another Photo Management System Using RDF. I could also store to-do lists, XML files of all persuasions, Microsoft Office and Open Office files, and other "digital assets," as I recently read about Jason Hunter and Joey Hess doing.
If I can forget about linking, I only need to add better metadata management to Subversion's excellent storage and version control and add some of the x, y, and z features mentioned above. But do I really want to use RDF to store the metadata?
The case for storing the metadata in MySQL
A relational database offers obvious benefits for storing data that fits easily into rows and columns, and you can have as many columns as you need. One great thing about Subversion's metadata capabilities is that the metadata is versioned, like the files that you check in, so that you could say that at r9, the editor of a document was John Smith, but at r10 it was Jane Jones, and you could always go back and see who was the editor at r9. A simple relational table could store the document's pathname as an ID, the property name (e.g. "editor"), the property value, and the release number. That's four pieces of information, and therefore not a great fit for an RDF triplestore.
The case for storing the metadata in an RDF triplestore
It would be a bit kludgy to add the release number as a suffix to the file ID (e.g doc/intro.htmlr9) and have a regular expression peel it off before presenting it for output, but at least it would squeeze these four pieces of information into a triple. It wouldn't be that much trouble, and by helping to distinguish between two versions of a file, we could consider a release number to be part of the identifier anyway. (I'm sure others have thought harder about this than I have, so I'd appreciate any pointers.)
By letting us store RDF versions of the metadata, what would this little kludge buy us? So far, I've thought of two things. First—and this is what gave me the idea for the whole thing in the first place—OWL reasoners could take advantage of the data. For example, if I want a picture of a logo, and I had declared that files with a type of JPG, JPEG, BMP, PNG, and TIF were image files, then I could easily search the metadata of just the image files without worrying about format. I'd love to hear more potential examples of useful, realistic OWL-based queries to do on such data, and probably won't start any coding until I find some.
The second advantage of storing it in RDF is that more metadata extraction tools are already out there for the taking. For example, there are free tools for pulling XMP-style RDF from JPEG and Adobe formats. More importantly, the GRDDL community is writing XSLT stylesheets to pull metadata from XML-based resources. I think that this community is a bit optimistic in hoping that movie theater and pizza shop web site owners will add processing instructions to their XHTML files that point to these stylesheets, but if the code is being written to do the extractions, there are all kinds of applications that can benefit. A Subversion-based DAM is one.
Any other suggestions or ideas?
Excellent write up. It's great to see people being excited about RDF and XMP. We have been boiling in pretty much the same technologies for quite some time. We've also been a big proponent of the idea of embedding and encrypting arbitrary amounts of data and business logic/forms into the files and created a powerful library that can embed RDF/XMP data into any file type. :)
If you eve feel bored give our office a buzz and somebody will give you a tour.
Alex M., MediaBeacon, Inc.