« Integrating relational data into the semantic web | Main | Ask a good linked data development question, go to Linked Data Planet for free »

Adding semantics to make data more valuable

The secret revealed.

Storing information about the meaning of terms—their "semantics"—can make data more valuable. Critics of semantic web technology consider such talk to be pie-in-the-sky AI talk; how can you encode the real meaning of words? More importantly, how can you do it in a way that programs can read and use to solve real data problems?

undoctored Lockhorns strip

The answer is very simple: you don't have to encode all of a term's semantics to get value from the standards and software used to do so. Let's look at an example.

What are the semantics of the word "spouse"? What does it mean to a recently engaged nineteen-year-old girl? What does it mean to a fifty-year-old man who's been divorced three times? What does it mean in a court of law in California, Mississippi, Austria, or Thailand?

That's a lot of meaning to store, but we don't need to store much to make a simple, mundane database such as an address book more valuable. Let's say my address book includes the following facts, and I want Leroy's home phone number:

  • Leroy has a work phone number of 212-334-4323.

  • Leroy has an email address of leroy@ngcorp.com.

  • Loretta has an email address of loretta031@yahoo.com.

  • Loretta has a home phone number of 718-928-6621.

  • Loretta's spouse is Leroy.

The only information I have about Leroy is his work number and his email address. I don't have his home number or any information about his spouse.

The W3C OWL web ontology language lets us declare that a property is symmetric, or as the OWL overview puts it, "if the pair (x,y) is an instance of the symmetric property P, then the pair (y,x) is also an instance of P." With software that understands an OWL expression stating that spouse is a symmetric property and a rule I define to say that spouses have the same home phone number, I can retrieve Leroy's home phone number from the little "database" above. (More likely, I would define a "roommate" property as symmetric and a rule saying that roommates have the same home phone number, and then declare spouse to be a subproperty of roommate, but you get the idea.) By doing this, I'd be using the OWL rules to let me pull more information out of the data collection than I put into it, making the data collection more valuable.

Plenty of software claims to make this kind of thing possible, but what interests me in OWL and related standards is the fact that they're standards, so that if I use OWL syntax to say "spouse is a symmetric property," a range of commercial and free software can understand and use that little bit of semantics that I've stored to help me get more work done.

It's easiest to demonstrate this with data stored using an RDF syntax, because the RDF data model has the closest fit to the subject/attribute-name/attribute-value statements in my little database above. If you prefer, though, more and more tools can keep the RDF part under the covers; my XML 2006 paper Relational database integration with RDF/OWL describes a related demo using address book data stored in MySQL tables. It shows a few more use cases of realistic questions to the database that get better answers because of semantics added using OWL.

There's a lot we can do with this technology...

Comments

(Note: I usually close comments for an entry a few weeks after posting it to avoid comment spam.)

Indeed. You don't have to know the semantics of the term "spouse", only of the concept of spouse. There are a lot of subtle different meanings for the term "spouse", and a lot of different terms for the concept of spouse, but the concept of spouse is unique in the universe.

Perhaps that's why AI in the 80's didn't work, with their emphasis on natural language.

You're telling a one-sided story here. There's potential value from adding semantics, but also potential harm. Consider my friends Mindy and Mandy, recently married in Massachusetts, but who are currently working in separate cities and visiting on weekends. I have a new phone, but it has a weird bug: whenever I try to call Mindy it rings Mandy's home phone. What the hell?

Turns out my phone has a semantically-enriched "smart" addressbook that deduces the "fact" that Mindy and Mandy are roommates (that share a home phone) from the fact that they are spouses. Or, worse, it won't let me link Mindy and Mandy as spouses because of an ontology that requires spouses to be of the opposite sex.

Of course, I don't know anything about this, because the software engineers have hidden all this semantic sausage behind a slick interface, so all I know is that my phone doesn't work right, and go out to look for a new one. All because the engineers decided to impose their worldview on me via an addressbook.

Sjoerd Visscher:

I profoundly disagree. The only way you have to even articulate "the concept of 'spouse'" to me, who wishes to know what you mean by the term, is to use words, one or many, words which are themselves polysemous. Adding a layer of concepts to pin down what is meant by words leaves you with something far harder to understand than words themselves are.

Eric Raymond tells the story of having received a letter about one of his books, the Hacker's Dictionary. He got it because he edited the dictionary: that is, he assembled it out of the evidence provided by the many participants. However, the letter was really intended for his editor, the publisher's employee who was responsible for preparing the book for publication. He, and the letter's author, were tripped up by the polysemy of "editor".

Could that have been foreseen in advance and accounted for? I don't think so. It's something that arises in the special case of dictionaries (and perhaps anthologies).

I apologize if this is too simple minded a perspective... but something which bugged the hell out of me during many years doing data conversions was the fact that none of the commonly used data-storage tools (e.g., Excel spreadsheets) have any provision for storing the units of measure. (Which I suppose is the most basic form of metadata.)

I picked on Excel spreadsheets, because those are extraordinarily messy when used wrong, as they are 99.999% of the time (even by me.) Though actually Excel does let you put the name of the unit in the column label (though not on a row below the column label) or in a note field or you can even used name variables (not that anyone in the business world ever does any of those things.) But database products, even fancy ones, are just as bad or worse.

The school system, BTW, does a rotten job of teaching kids how to use units (and other metadata, I suppose.) I used to grade open ended questions on math tests, and the biggest source of trouble was when kids combined the variables wrong in ways which would have been obvious if you paid attention to the units of measure. Students usually don't get any training in this sort of thing until they begin taking quantitative courses in college, and by then it is too late, especially if the student only takes the minimum Phsyics for Poets and Business Math to meet their degree requirements.


**rant mode off**

Here is some meta-metadata about this thread (i.e., data about the data about the data.) The post is illustrated with a Lockhorns cartoon. Our multitalented host, Bob DuCharme, used to play guitar with a band called the Lockhorns.

Some of their tunes, including Bob's song Hiwataha can be found at:

http://www.philipshelley.com/words/?cat=10

Thanks Tim. I've blogged here about the Hunting Accident, with a pointer to the "canonical" version of Hiawatha, at http://www.snee.com/bobdc.blog/2006/05/me_as_80s_new_york_lead_guitar.html.

I wasn't really a full time member of that band, but as the founding member's roommate, I filled in for various members as necessary--on guitar in the recording you point to and percussion in the session that Philip describes at http://www.philipshelley.com/words/?p=60.

@Sjoerd: What you do is exactly the reason why this debate keeps going in circles. You ignore what problem is supposed to be solved. If you want to know what Bob means by spouse, you have to ask him, and of course there is nothing better than words for him to explain it to you, because both of you are human beings. But that wasn't the problem Bob attempted to solve.

What he wanted to do is to run a particular query on address book data: "What is the home phone of Leroy?". And what he shows very well, I think, is that collecting a small amount of metadata sometimes allows us to ask questions about our data that we wouldn't be able to ask otherwise. We can ask more questions without much additional effort if we enable the dumb machine to do a few inferencing steps for us. That's all.

It's completely beyond me why the simple affair of deterministic inferencing presses some kind of "philosophy button" in some people and suddenly it's all about defending the power of natural language against some evil formalism in a race to define the world. There's no need for such a defense as there is no such race. It's like defending an artistic painting showing a bridge against an architects CAD software that helps him design one. I've never heard painters and architects have such debates. Amazingly, we have them all the time in IT.

Sorry, my reply was directed at John Cowan not at Sjoerd Visscher.

And just to hammer in my original point, "what Bob means by spouse" was that it's a symmetrical property, and that's all I meant, and that alone is useful.