What's wrong with undeclared classes and properties?

It's not like the RDF spec requires them.

OK, it's a rhetorical question. I know the answer: we can attach metadata to class and property declarations, so when we know that a given instance is a member of a particular class and has certain properties, if those are declared, we know more about the instance and can do more with it, not least of all aggregate it more easily with other data that uses the same or related classes and properties.

I learned from Paul Gearon and Tom Heath tweets that section 2.3.2 of the "Weaving the Pedantic Web" paper (pdf) presented at the Linked Data on the Web conference in Raleigh bemoans the existence of undeclared classes and attributes. I agree that this is not a good thing, but we should be careful about attacking it.

The Pedantic Web paper does point out that "such practice is not prohibited", which many people seem to forget. This reminds me of the decision to qualify merely well-formed XML as legal, parsable markup, which was one of the big breaks that XML made from SGML, or Tim Berners-Lee's decision to accept the possibility of broken links in his hypertext system, unlike those of his predecessors. Serious XML-based applications still use DTDs or schemas and well-maintained web sites use some kind of link management, but the simpler, grass roots efforts don't necessarily, and that turned out to be a great thing. It let these technologies grow to a point where millions of people can see their benefits.

If I have a triple that says

<http://www.snee.com/d/r/s3/l9d> <http://www.snee.com/8r/xa/32e>  "true"

and my subject and predicate aren't declared anywhere, it doesn't tell you much. If I have one that says this with an undeclared subject and predicate,

<http://www.snee.com/d/r/invoice#l9d> <http://www.snee.com/8r/xa/paid>  "true"
I worry that I fall into the standardista class because I think that using the word "semantic" in your marketing literature isn't enough to qualify your work as part of the semantic web.

you can get a general idea of what's going on even with no declarations, as you often can from element and attribute names in XML documents that have no corresponding schemas. Unlike the XML example, though, we can see a domain name associated with "invoice#129d" and "paid" here, which gives some context and therefore a bit of semantics about them.

One great thing about RDF is that you can add on metadata after the fact, as Jim Hendler's group at RPI is doing with a lot of the US government data. Third parties certainly can't fix broken web links, and while James Clark's wonderful trang can generate schemas from documents, that's more useful as a content analysis tool than as something that you'd use to create production schemas. Adding metadata such as declarations to triples after the fact is a perfectly normal thing to do, and it helps connect those triples to each other to form a, you know, web.

I certainly don't want to imply that the Pedantic Web effort is doing anything wrong; their efforts to educate people about the value of doing these things with more rigor are very valuable. In the name-calling that most discussions of new technology seem to devolve into these days (pedant! fanboy! standardista!), I worry that I fall into the standardista class because I think that using the word "semantic" in your marketing literature isn't enough to qualify your work as part of the semantic web. I want to see support for relevant W3C standards involved, a position that apparently can get me lumped into the class of unreasonably demanding geeks who don't appreciate the big picture, so I wanted to point out that the (spec-compliant) optional nature of class and property declarations can be a huge contributor to the growth of the semantic web.

XML and Tim Berners-Lee's hypertext system scaled up to the point that they did because of both carefully engineered efforts and the fast growth of unrigorous ones. Careful engineering of a system using semantic web technology can get a lot of value from class and property declarations, but we should remember that the other great thing about RDF, besides the ease of adding metadata to existing data, is that triples are simple and easy to aggregate and therefore share. Let's not discourage people from doing so if they don't happen to be doing it the way that we would.

3 Comments

I've always said dereferencing is a privilege not a right; there will certainly be RDF/OWL vocabs that aren't public, even while bits of data using those vocabs might leak out. This is fine and inevitable. The reason to describe your properties and classes, and make them deferenceable, is just that is makes folk more likely (and more able) to use them. And by documenting the 'real' vocab it makes error detection easier, since a typo in the name results in different behaviour. There are other ways around that one of course (eg. stats from aggregators).

It's nothing fancier than - 'If you want lots of people to use your stuff, document it carefully'. I don't see any huge difference here between RDF, XML or general software documentation issues.

The classic undocumented properties in RDF are rdf:_12345 etc ... maybe someone should update that schema, building on the fantastic Linked Open Numbers work? :) http://km.aifb.kit.edu/projects/numbers/


You are right, but emphasizing this doesn't bring any advantages as far as usefulness is concerned. Let's not forget RDF is meant to be consumed by machines, not humans. Machines cannot see inside URIs, not literals... So I wouldn't call this helping the growth of the Semantic Web but rather helping the growth of the Linked Data. I expect knowledge using RDFS/OWL to be called Semantic Web, but this data I would be reluctant to call knowledge since is isn't really machine understandable at all (anyway lot of markup oriented people are confused by this too).

Still it's better than the mess we are in now...


@Jira, ... even if people aren't reading the RDF directly, they're still often writing software that matches its patterns, or composing queries, or running analytics. And in practice this is often done in an example-driven manner. When developers encounter a new dataset, they're far more likely to seek out example instance data, than to go meta and read the schema. The schema is there for reference and checking, but commonly skipped over until things go wrong. Examples are much more important to real usage...