My old friend Dale Waldt (I remember, immediately after the announcement of the existence of XML at SGML 1996, going up to my then-coworker Dale and asking "So what do we think?") recently posted an entry on the Gilbane XML blog titled Why Adding Semantics to Web Data is Difficult. A few days ago I posted a comment saying that the things that he saw as missing from semantic technologies are actually already there and working well, but my reply hasn't shown up yet, so after a bit of revision, I'm putting it here. For my blog entry categories, I've put this under "Publishing" because most of what I've written below is already familiar to people in the semantic web world, but not as widely known in the publishing world.
Dale wrote:
Consider though, that the schema in use can tell us the names of semantically defined elements, but not necessarily their meaning. I can tell you something about a piece of data by using the <income> tag, but how, in a schema can I tell you it is a net <income> calculated using the guidelines of US Internal Revenue Service, and therefore suitable for eFiling my tax return? For that matter, one system might use the element type name <net_income> while another might use <inc>.
This is why the semantic web is built around URLs, not just element names. If someone refers to a "title" and you don't know whether that person is an HR administrator who means "job title" or a realtor referring to the deed to a piece of property, you don't know what they mean. However, if I refer to a http://purl.org/dc/elements/1.1/title, you know that I mean the title of a work or resource, because the URL makes it clear that I'm referring to the Dublin Core sense of the term.
The things that Dale saw as missing from semantic technologies are actually already there and working well.
As I understand it, XBRL's goal was not to standardize the vocabularies of element type names as much to standardize ways of identifying them. For example, in GE's XBRL financial statement, they chose to identify net income with the URL http://www.xbrl.org/us/fr/common/pte/2005-02-28#usfr-pte:NetIncome and have this declared in a filed document. Instead of encouraging everyone to create their own new vocabularies, though, the XBRL effort did create a set of US GAAP taxonomies, and these are forming a core set of documented, commonly understood terminology for U.S. accounting.
How will we know that elements labeled with <net_income> and <inc> are the same and should be handled as such?
Let's assume that company X uses the term "net_income" and company Y uses the term "inc". When they publicly define what they mean by these terms using OWL ontologies or XBRL taxonomies, they avoid the confusion you describe by defining them with URLs, just as the OCLC did for Dublin Core terms, so let's say the terms' full names are http://www.x.com/ns/xbrl/net_income and http://www.y.com/some/path/inc. (Of course, if an XML document includes the namespace declarations xmlns:x="http://www.x.com/ns/xbrl/" and xmlns:y="http://www.y.com/some/path/", the element names can use the abbreviations x:net_income and y:inc.)
The following bit of OWL asserts that they're both the same as GE's term for net income, and a SPARQL query that uses the GE URL to say "get me net income figures" will get the others as well:
<owl:ObjectProperty rdf:about="http://www.xbrl.org/us/fr/common/pte/2005-02-28#usfr-pte:NetIncome"> <owl:equivalentProperty> <owl:DatatypeProperty rdf:about="http://www.x.com/ns/xbrl/net_income"/> <owl:equivalentProperty> <owl:equivalentProperty> <owl:DatatypeProperty rdf:about="http://www.y.com/some/path/inc"/> </owl:equivalentProperty> </owl:ObjectProperty>
This nicely demonstrates the potential of OWL as metadata that adds value to existing bodies of data.
OWL has been a standard for four years, and there are several implementations available that let you do this. (Speaking of semantics, in addition to defining such equivalences, OWL can also encode semantics.)
The great thing about OWL's relationship to XBRL is that much of XBRL is about defining taxonomies and semantics, and OWL is about building on such definitions to get more value out of data.
Obviously a industry standard like XBRL (eXtensible Business Reporting Language) can help standardize vocabularies for element type names, but this cannot be the whole solution or XBRL use would be more widespread.
XBRL helps to standardize naming within the world of business reporting, but the need for vocabulary definition standards and tools goes well beyond that world. (The full set of XBRL specs is also a complex solution to a complex problem, which slows the adoption from getting widespread very quickly.) The goal of RDFS was to help people define such vocabularies, but OWL provides a superset of RDFS and offers more slick tools, so people sometimes build OWL ontologies when they only need an RDFS vocabulary.
I think the Semantic Web will require more than schemas and XML-aware search tools to reach its full potential in intelligent data and applications that process them. What is probably needed is a concerted effort to build semantic data and tools that can process these included browsing, data storage, search, and classification tools.
For data storage and search, commercial and open source triplestore tools are available. (I recently mentioned that I've been blogging less because I've been looking into them.) For browsing, new semantic web Firefox plugins crop up all the time. I'll discuss classification next week, but as a hint, it turns around the question of what semantic web technology can bring to the publishing world—it's more about what they can learn from the publishing world.
The Semantic Web ideals, while quite exciting, have always struck me as too much of an all-or-none proposition: either my data is part of this universal graph of knowledge or it isn't, based on whether I have encoded my data in triples (RDF, RDFa, etc). But it is not always the consumer that needs help "understanding" my data's place in that graph -- I as the producer do as well. And semantic assertions (i.e., this tag equals dc:title) take time and understanding, which many/most do not have. What they DO have is domain knowledge. E.g., "Here's what the figures in this column of ths spreadsheet I am publishing on the web as an HTML table mean."
I'd love to see tools that allow publishers to make their data "smarter" over time -- not as an all-or-none proposition. Yahoo's Search Monkey and perhaps GRDDL (?) are perhaps steps in the right direction. As another example, tagging seems to be a fairly easy-to-grok and easy-to-implement feature. How about more focus on something simple like tagging and tools that allow the publisher to then create equivalencies between a tag on their site and some domain-specific ontology if such exists (and probably best we don't use the word "ontology" ;-)). My take (influenced by working w/ folks in higher ed) is that folks are willing to do a bit of work to "rationalize" their data, esp. if they gain some benefit. But not a lot of work, and especially not if they need to understand a whole new world of knowledge representation.
Our approach has been to create a system (theoretically) as easy to use as FilemakerPro, Microsoft Access, Excel, etc. Users can create arbitrary sets of "attributes" for their collections of digital things (audio, images, video, documents, web pages) and then assign values as they wish. They may start with just a title and date, but when possible they may add much more detailed metadata. And commercial sets of, say, images+metadata are easy to incorporate as well.
Everything is stored in a backend with Atom/AtomPub interfaces in and out. The key-value pairs are simply held in atom:category elements -- one atom entry for every item in the system. Many of these collections do, in fact, map to existing metadata schemes, VRA Core4 for images, for example. But indeed, eveything has a scheme, if only local to that one users collection. Much is gained here in terms of interoperability (Google spreadsheets is becoming a favorite data creation tools, since it is so easy to "import" into our system), preservation, data portability, etc. And if/when a set of data needs to enter the cloud of linked data, asserting the equivalencies and serializing to RDF is quite easy.
I guess my point is that there is some low-hanging fruit on the way to the Semantic Web that does not require publishers to join up here and now. Simply thinking in terms of regularizing metadata schemes, data portability, simple xml-based formats (Atom +1) get us a very significant way along a useful path. Not, certainly, the whole vision of the Semantic Web but quite useful nonetheless.
Bob:
Sorry your comment didn't show up. I just found it in the comment spam folder, published it, and sent Dale an email.
Frank