Danny Ayers recently emailed me about a posting by IBM's Lee Feigenbaum on the W3C's Semantic Web Education and Outreach Interest Group mailing list. Lee had written about a colleague's concerns about semantic web technologies, and Danny asked for my thoughts on the issue. I e-mailed him a few paragraphs, and since then I thought that I might as well post them here, with a bit of copy-editing and a few extra thoughts.
Lee's colleague
expressed concerns that SW technologies (and RDF / SPARQL in particular) may fall short in one prominent area in which XML / XQuery shines: dealing with content-oriented (often mixed content) documents. He was concerned about this given some of our claims about the value of RDF/SW technologies as a unifying environment for data and metadata.
He gave various examples ranging from insurance policies to resumes to rental agreements, with the basic idea being that XQuery can easily answer questions that involve searching within a document (or, more-so, searching for text in a particular paragraph of a document, perhaps with emphasis added) which uses XML markup. He wondered aloud and we discussed what the SW approach to this would be, and we agreed that it's lacking right now. He expressed worry that whereas XML can wrap data that might be best expressed as relational or RDF data (and then join that data in XQuery queries with document data), the RDF world may not have as nice a story.
Yes, RDF and related technologies fall short in areas where XML and XQuery shine, but XML and XQuery fall short in areas where RDF shines. (And they both fall short in areas where relational databases shine, and... etc.) RDF is a data model. Certain problem domains map very well to that data model, especially large collections of assignments of values to objects that don't normalize into relational tables or even a single XML schema well. An add-on like OWL makes it easier to define relationships between seemingly unrelated classes of information, making it easier to use the aggregate sources together.
RDF can add a lot to a publishing system, but tracking the relationship between in-line elements and their containing block elements (that is, mixed content) is not something it can help much. For example, it can be used to store metadata about document components and associations as document files moves through a workflow. (So can plain XML as retrieved by XQuery, but RDF-based data from documents in different formats can be aggregated and used with less custom coding.)
For some perspective on what RDF can contribute to an XML-based system, it helps to forget one thing (RDF/XML—everything I describe here would work just fine with other RDF syntaxes) and to remember something else: RDF's ability to store metadata about anything with a URI means that it can be used to track information about any XML element with its own ID. In the case of block elements, this is useful for the publishing industry because if one block of a document stores a recipe, another a book excerpt, and another a picture, there will be separate metadata to store about each. (For this sort of thing, I think that RDFa will help to lure back people who were scared off by RDF/XML.) Even inline elements as independent units to track can have value added by RDF if they have an ID; a linking element may have a link type assigned, the date that the link's validity was last verified, and other metadata. To take advantage of an inline element's relationship to its text node siblings and their containing element, though, you'll need something that can parse and read the combination such as an XSLT processor or, for sufficiently large XML, an XQuery processor.
Searching within documents is certainly where XQuery shines, but unless you're using an XQuery engine for pure substring search (for example, "show me which documents have the string 'fireplace' in them"), the insurance policy and rental agreement examples would only work well with XQuery if all of the documents conformed to the same schema. The RDF/OWL strength that makes it popular for semantic web work is its ability to query collections of data in the same domain that aren't necessarily all of identical structure. A collection of insurance policies from different companies will have some fields in common, some different fields, some fields that look different but mean the same thing... treating them as a consistent collection will take a lot of XQuery custom coding, but with RDF + SPARQL, it will only take the application of an increasingly popular standard way of specifying the semantics of each company's forms (OWL) to treat the collection as a single aggregate to query. If you add a set of insurance forms from another insurance company to the set, you only need to add a little more to your OWL, and you can leave your SPARQL queries alone. Done the XQuery way, accounting for this new data will mean checking all your FLWOR expressions to see whether they need revision.
My XML 2006 talk was unfortunately in the same time slot as another one on integration of different data sources using RDF/OWL, and this other one used XQuery as well. I'm looking forward to finding out more about what Ken and Ronald did and how they did it; more information is available at a page they did for the project, although I haven't had a chance to look closely at it yet.
Link to XML2006 project page for Ken and Ronald, http://www.rrecktek.com/xml2006/, is not working.
It looks like the whole site was down and is now back up.