« Meta-metadata | Main | Metadata since the nineteenth century »

All Your Google Base metadata taxonomy are belong to us

Google gives us a taxonomy.

[All your base!]

When you upload data to , you can name your attributes whatever you like, but Google has given you a head start by providing a of attribute names for the information that people are likely to upload, such as course schedules, jobs, and housing listings. Don't worry about whether to store your magazine name as PubName or PublicationName; the Google Base page documenting XML attributes shows that you're best off storing it as a g:publication_name. (Note to XML geeks: they mean "attribute" in the database sense here, not the XML sense, despite the "XML attributes" title of the page documenting their taxonomy. See the Data Model Comparison Table mentioned in my last blog entry to compare Google Base data modeling terms with others.)

Hardcore markup people know that the "g" prefix isn't really part of the name. It's standing in for a URI that is the real identifier for a particular collection of names, and if a document declares "xxx" as the prefix that represents the same URI, then an application should treat xxx:publication_name the same as it would treat g:publication_name from one of Google's sample Google Base documents. (If you're not sure why, see Ron Bourret's XML Namespace FAQ, which I try to reread at least once a year.)

The Google Base - Provider Namespace documentation tells us that "The 'g:' prefix is reserved for the Google Base XML module and should not be used," which shows that someone got sloppy in coding the Google Base system somewhere. I took one of their sample documents, changed the declaration to xmlns:xxx="http://base.google.com/cns/1.0", changed all the g: prefixes to xxx:, and uploaded the document, and Google Base did the right thing and recognized names from that namespace even with the xxx prefix. It still worries me a bit that it was much easier to find the use of the g prefix at base.google.com than it was to find the URI that it represents, because too many people still think that the prefix name is the namespace name, not a temporary stand-in for the full name to reduce markup bulk.

There's another important issue about Google Base for people interested in metadata and taxonomies to consider: I mentioned earlier that you don't have to worry about whether to store your magazine name as PubName or PublicationName, because Google has already picked a name. If Google Base gets legs, this taxonomy will get legs. Those of us who think of Dublin Core names such as dc:creator as pervasive and well-understood may eventually see more g:author elements than dc:creator elements out there.

Sam Ruby has pointed out some sloppiness in the taxonomy and its documentation, and I've found a bit myself. For example, the first sample file I clicked on in section 4 of the Atom 0.3 Specification documentation, news-atom-template.xml, wasn't well-formed—the first tag's second attribute value was missing a closing quote, which is a pretty unprofessional mistake when a major brand name is presenting a model for people to follow.

Google Base also lets you create your own attributes, which is nice, and I'm sure that their metadata experts will look closely at this as it develops. And meanwhile, whether Google Base takes off or not, taxonomy specialists should prepare for the possibility that this "Google Core" namespace may show up in petabytes of data.

Comments

(Note: I usually close comments for an entry a few weeks after posting it to avoid comment spam.)

Does Google Base even work at this point? I just tried "publishing" a few items (specifically, events) and even though their status is "Published", they don't show up in any searches.

When I tried "bulk" uploading two 2K files, they were in a "pending" status before they showed up as regular items, but there was an indication of their pending status.

Bob