« The LinkedData Planet conference | Main | Unsung Super Bowl hero »

The future of RDFa

Think big.

Since the beginning of RDFa's history, many of its advocates have stressed its value in adding machine-readable semantics to personal web pages. This example from the RDFa Primer is typical:

 <p class="contactinfo"  about="http://example.org/staff/jo">
    <span property="contact:fn">Jo Smith</span>.
    <span property="contact:title">Web hacker</span> at
    <a rel="contact:org" href="http://example.org">Example.org</a>.
    You can contact me
    <a rel="contact:email" href="mailto:jo@example.org">via email</a>.
  </p>

An important principle has been the ability to make a web page's data readable by both eyeballs and automated processes. This is great, but there are two related issues that I feel need a higher profile: first, RDFa has great potential for storing non-eyeball information in web pages. Secondly, examples like the one above go after microformats on their own turf, where they're dug in pretty well. Being a more generalized, scalable solution, RDFa can do a lot more than microformats, and with many of those other applications having more commercial potential, I see them as the best growth area for the format.

First, the non-eyeballs part. When I speak about RDFa to people with a publishing background, they like its ability to store metadata such as workflow information. Some had heard of RDF in its RDF/XML incarnation, and it was just too complex for them. RDFa isn't. I submitted an example of this kind of workflow metadata usage to the RDFa Use Cases document, where it can provide a placeholder for future work. People often say that it's difficult to measure RDF adoption rates because so much of it is behind firewalls; electronic publishing workflow metadata is a pretty classic case of this, considering that publishers want to track various bits of information about documents as they work on them but don't want to include that information in the publicly available versions, so again, I think it's great potential growth area for RDFa.

Being a more generalized, scalable solution, RDFa can do a lot more than microformats, and with many of those other applications having more commercial potential, I see them as the best growth area for the format.

I wrote recently about how microformats, the semantic web, and the linked data movement are making more data available as HTTP-accessible resources. The linked data strategy is often to build a front end to a data source that lets you issue SPARQL queries against it—a "SPARQL endpoint" —and/or to maintain an updated copy of valuable information to query against, as with DBPedia. Microformats and the semantic web efforts (or at least the RDFa aspect of this) compete more directly with each other, each offering ways to embed semantics and machine-readable data into web pages, so it's worth examing what each does well and what clues this offers about their future.

The microformats effort has settled on formats to represent vCard contact information and outlines in HTML, and there are various efforts to re-use existing bits of HTML markup for other domains, but there's a much longer list of failed (or rather, "moribund") microformats efforts. Microformats' hCard conventions for contact information looks like a success, and the XOXO outline effort addresses a problem that RDF was never very good at anyway: imposing structure on the relationships among collection of data.

The list of moribund microformats efforts shows that it's moving slowly, if at all, to many new domains, and my theory is that it's so slow because for each new domain a new set of things needs to be worked out: how to identify each piece of information and where to put it in the available HTML slots. They have a few design patterns to guide this process, but I know of no generalized microformats way to say that a given resource has a given field name/value pairing in a way that would work for all resources and fields. RDFa's use of actual specifications (as opposed to warm and fuzzy exhortations like "pave the cow paths" and "a way of thinking about data") make the RDFa representation of any straightforward facts pretty simple, as long as a vocabulary exists to describe the resources and attributes. If it doesn't, you can make one up, but they can build on existing naming schemes such as SKU or ISBN numbers.

These two naming schemes in particular can cover a vast amount of machine-readable data that's worth embedding into web pages. For example, if the book with ISBN 1930220111 is for sale for $19.77, then it's pretty clear what's going on here:

<span about="http://site:www.isbn.org/1930220111" property="cbc:PriceAmount">19.77</span>

(I'm assuming for now that an application reading such data would only be interested in its developer's local currency, which leaves plenty of useful applications to write.) If you and I each have a million triples of pricing information, but you used something other than the UBL urn:oasis:names:tc:ubl:CommonBasicComponents:1:0 namespace to indicate your PriceAmount predicates, a simple OWL rule can tell a program reading these prices that you and I meant the same thing by the two different predicates we used.

Pricing is a good example. It's a huge area where people would be happy to give away data in the form of extra embedded metadata in their web pages, because it can drive new paying customers to the source of that data (for example, to sell more copies of the book with the ISBN 1930220111). Scheduling is another example of how giving away data such as flight times or movie times can drive paying customers to an organization with something for sale. Microformats have made some progress (the German Depeche Mode party list?), but I think that RDFa can make a lot more progress here.

Let microformats do what they do best: shoehorning bits of personal data into leftover HTML attributes that no one was using (such as the abbr attribute for dates) and adding <div class="foo"></div> and <span class="foo"></span> elements in places where they wish HTML offered a foo element. That's not going to scale to more enterprise-oriented data, because there are no clear answers to questions about the relationships between the various bits of markup. For example, what does <div class="title"></div> mean? The title of an audio track or a job title? I suppose it depends whether the div element in question has a <div class="haudio"></div> ancestor or a <div class="vcard"></div> ancestor. So what role does a div element play in setting the context of its descendants? Hell if I know; a search for "div" at microformats.org just brought up "No page title matches" and "No page text matches". The documentation for the class design pattern tells us that "if an appropriate semantic element is not available, use span or div", with no clue about what might be special about div. The documentation for the elemental and compound design patterns don't offer any more help.

This is not a markup infrastructure that someone can take and run with to develop (or even augment) applications for arbitrary data domains. RDFa is way ahead of microformats in its ability to do this, so its best opportunities for traction are in domains with a lot of structured data that doesn't fit well into hCard format or the two or three other microformat success stories.

There are plenty of these. Those who would benefit most from giving away embedded machine-readable data are companies and other large organizations who are now generating tables of HTML describing their products and services using PHP, Perl, Ruby on Rails, or other scripting languages, and a few tweaks ([1], [2]) to those scripts can make a wide range of that data machine-readable RDFa in addition to being human readable data. Let's find the people who can get those tweaks made and convince them of the value of doing so.

Comments

(Note: I usually close comments for an entry a few weeks after posting it to avoid comment spam.)

The problem I'm having with RDFa is that the semweb community avoids commitment to any common formats whereas the microformats community shares the tools and stylesheets for parsing, which in the end enforce a loose standard. If your format shows up in operator it's ok, if not you need to work on it. What I ended up doing was hAtom on the page, then style to Atom, then RDF.

I know you're thinking wait, we've got lots of standards in RDF. When there is a format the XSD types are vague, sometimes even omitted. The cardinalities are often omitted. We need a place to declare "here is how we format ____ in RDFa and here is a tool that will tell you if your stuff is good or bad".

So let's just pretend we've found somebody who emits lists of products and services onto webpages...what format do we recommend they use?

Taylor

I think that rdfa.info would be a good candidate for a place to store documentation about such best practices.

I'm not sure what you mean by "shows up in operator"--I find that if I coded some RDFa correctly, any of the tools mentioned in "Getting Those Triples" at http://www.xml.com/pub/a/2007/02/14/introducing-rdfa.html?page=2 pull out the triples I expected, so I consider those to be shared, available, consistent tools. Several formats for coding the triples are possible, but I wouldn't recommend one over the other, as long as the triples extracted with the tools showed that the chosen format was used correctly.

I haven't played with typing of values in these triples much, so maybe those are less consistent; is that what you meant about XSD types?

Thanks for an interesting read, just some comments where there seems to be a bit of confusion:

The phrase "shows up in operator" refers to the Operator extension for Firefox, which is the basis for the built in Microformats support in Firefox 3.0:

https://addons.mozilla.org/en-US/firefox/addon/4106

> with no clue about what might be special about div

Div is a block level element, use it if you need a block level element. Span is an inline element, use it if you need an inline element. I'm not sure what it is you think should be special about it? The class names can be applied to arbitrary elements as appropriate to your markup, the element is not usually significant.

Rob,

Bit of confusion is right! I was trying to learn more about div because I was wondering if there was a way to tell whether class="title" refers to the title of an audio track or to a job title, because I found both among microformats examples. (For example, does <span class='title'>Bell Boy</span> refer to someone who helps you with your luggage or the Who song from Quadrophenia?) I guessed that the criteria was whether the closest enclosing div element had a class value of haudio or vcard, but could find no confirmation of this, or anything else describing how the value of one div's class value could affect the interpretation of another one.

This is why I was trying to find some documentation about whether there's anything special about the div element. I suppose I shouldn't have assumed that there was something special about the relationship between the value in question and the enclosing div element that identified which microformat (haudio or vcard) was in use; perhaps a span element can identify which microformat is in use and indicate how to interpret what "title" means in a given context. I'm guessing, I'm making assumptions that may be incorrect, and I wish I had a place to look up the answer, but I couldn't find it. I believe that the need to make such guesses and assumptions prevents microformats from being a good format for many domains where RDFa would work well, because RDFa has a spec and it's built on a simple, documented data model.

"I haven't played with typing of values in these triples much, so maybe those are less consistent; is that what you meant about XSD types? "

Yes, that's what I meant. Looking back at your triples introduction article you used the Dublin Core element set, so I'll pick on that since we're both familiar with it. The schema for that is here http://purl.org/dc/elements/1.1/. I think dc:subject is similar to a tag set, but the schema doesn’t indicate if it’s a comma separated string, or space, or a bag, or a set, or another datatype node. It doesn’t say anything at all, except that the property exists. dc:date is worse. remember, we're talking machine readable. When a machine encounters a dc:date of 07/08/09 what does it think it means?

Technically RDFa has tremendous advantages over microformats, and we can be very specific with our vocabularies using RDFS and/or OWL. But the RDFa camp is lacking where it counts most...community building, and collaborative vocabulary definition.

From rdfs.info "you choose which attributes to use, which to reuse from other sites, and how to evolve, over time, the meaning of these attributes."

That bothers me. Just sounds too loosey goosey. For meaning to have an impact it needs to be shared and agreed upon. Publishers are thinking "how do I get reach"...aggregators are thinking "how can I find semantically meaningful content". I'd like to participate in building these for my domain but I don't know where to engage, while it's pretty clear on the microformat side of things.

Taylor,

dc:subject is a property, and that's all. In my opinion, building data structures out of triples is what got RDF/XML into trouble, and I get along fine by just not doing that. If I want to say that the resource at http://www.snee.com/bobdc.blog/2007/08/automated_rdfa_output_from_dit.html has both RDF and DITA has subjects, I'll just do it as two triples,

<http://www.snee.com/bobdc.blog/2007/08/automated_rdfa_output_from_dit.html>
<http://purl.org/dc/elements/1.1/subject>
<http://www.snee.com/bobdc.blog/metadata/rdf/>.

<http://www.snee.com/bobdc.blog/2007/08/automated_rdfa_output_from_dit.html>
<http://purl.org/dc/elements/1.1/subject>
<http://www.snee.com/bobdc.blog/metadata/dita/>.

or, in RDFa,

<span about="http://www.snee.com/bobdc.blog/2007/08/automated_rdfa_output_from_dit.html"
rel="dc:subject" href="http://www.snee.com/bobdc.blog/metadata/rdf/"/>
<span about="http://www.snee.com/bobdc.blog/2007/08/automated_rdfa_output_from_dit.html"
rel="dc:subject" href="http://www.snee.com/bobdc.blog/metadata/dita/"/>

and I won't even worry about trying to treat dc:subject as multi-valued thing. (There are other ways to do it in RDFa, especially from within the document serving as the subject, but RDFa tools will pull the same triples out of those as they will from the two span elements above.)

For machine-readable dates, ISO 8601 has no real competition; in fact, it's one of the canonical RDFa examples of using the content attribute, to do something like this:

property="dc:date" content="20070315T15:32:00">March 15, 2007, at 3:32 PM

RDF(a) aside, it says right at http://dublincore.org/documents/dces/#date that ISO 8601 format is the best practice for dc:date.

I think that microformats' lack of a spec makes it far more loosey-goosey than RDFa. There is clarity in microformats if you limit your domain to contact information or outlining, but there are many more kinds of data out there. See also my reply to Rob above.

The problems the abbr design patterns (in particular) create for users of assistive technologies make adoption of microformats pretty much impossible for government sites with accessibility requirements. So, not only can RDFa do everything microformats can, and then some more (as you point out) they do it without adversely impacting usability for people with disabilities.