SearchMonkey and RDFa

What am I missing?
[searchmonkey logo]

Yahoo! SearchMonkey is one of those interesting, RDF-related technologies that I'd been meaning to check out for a while, and when I saw how much of the reaction to Google's Rich Snippets was people like Ryan Smith or Peter Mika in the May Semantic Web Gang podcast saying that Google was just doing what SearchMonkey had already done, I knew that it was time to look more closely at SearchMonkey.

I wanted to see support for RDFa embedded in HTML, and to be honest, I only see it in SearchMonkey if I squint while I'm looking and tilt my head slightly sideways. Perhaps I'm missing something, and I hope someone points it out to me.

According to the Site Owner Overview, there are two ways to take advantage of SearchMonkey: Standard Enhanced Results or Custom SearchMonkey Applications.

Standard Enhanced Results

The Site Owner Overview page says this is "Currently available for certain content types such as Video, Games, and Documents". Sounds good to me; I'm very interested in adding metadata to documents. According to the Documents page, though, "the Yahoo! Search document reader currently supports Flash documents only". If you want to use RDFa to identify specialized metadata for Yahoo to use when they return your document in a search result list, your document must be stored in a Flash document, and then you embed your metadata in the attributes of an object element that points at that document.

I think it's great that this lets us use RDFa to assign metadata to slideshare and Scribd documents, but if this has such a strong dependency on a binary format controlled by a single software company, I'm not that interested.

Custom SearchMonkey Applications

OK, so I don't want to see a shared web publishing infrastructure have such dependencies on this proprietary binary format. The SearchMonkey Getting Started page tells us: "Don't have Flash objects? Or want to build an app to display custom enhanced results? Head on over to the SearchMonkey Developer Tool to build an app where you can display a custom image, extract structured data from your site, [or] link to pages within your site". This sounded a bit better.

According to the SearchMonkey Application Dashboard page, "Presentation Applications are small PHP apps that display enhanced search results using data services. You can use an existing data service or create a custom service below". When I went through the steps of building a Custom Data Service based on an existing one, it asked me for a URL pattern to specify pages where it should look for data and URLs that fit that pattern to use for testing. Then, it showed the XSLT that it would use to extract data, displayed in an edit box where I could customize it.

You use this stylesheet to "specify XSLT code for extracting information from the page and representing that information as DataRSS". Despite the admonition to "avoid using namespaces in your XPATH expressions, as SearchMonkey strips these out", this looked like something I could work with once I get to know the DataRSS format. (There's a schema on that page to use for testing your stylesheet output.)

So if I point Yahoo at some documents and write a stylesheet that goes through those documents and returns DataRSS, SearchMonkey can use this. I could put RDFa in those documents and have my stylesheet get DataRSS data out of that... but I could also make up my own BobFooBar format to embed in the HTML and have my stylesheet get DataRSS out of that as well, so I don't really see how this counts as RDFa support.

The Semantic Web community is still trying to piece together the nature of Google's support of RDFa in HTML documents, and there are things to complain about, but we know that their crawlers will look for some sort of RDFa in HTML documents. This looks like a real step forward for support of standards-based metadata on the web by a major search engine. Perhaps my review of the SearchMonkey options is missing something, but so far I haven't seen anything to show me that what they offer is something for people interested in open web standards to get excited about.

Again, if I'm wrong about any of this, I'd be happy to be corrected.

9 Comments

SearchMonkey is similar to the tripblox concept where other sites provide the RDFa...search monkey sees it, and therefore can list items in a more meaningful way.

I don't think there are any RDFa tie ins but microsoft bing has this flavor too. You can type "hotels in" and you're shopping for hotels on a map, but in a vendor/supplier neutral way.

So sites like Expedia, Orbitz, Travelocity write software to list travel search results. We know the content is travel related (hotel/air/car) and have custom views for that...so the search is vertical, a specific domain. Now the horizontal search tools are finding ways to semantically recognize content and list it in horizontal specialized ways. viewzi is another example.

So the very wide, general implication I see is that search tools are getting better, and allowing users to search supplier agnostic, price compare, and then they arrive at the vertical site ready to make a purchase. RDFa makes it possible for the small fries to be seen by "big vertical search" and have their results listed in a very meaningful way, for example, a hotel could be listed just as elegantly on search monkey as it's listed on expedia...and since search monkey gives you expedia/travelocity/orbitz results + the small fry suppliers with RDFa on their site, where you you start searching?


Taylor,

What you're saying in general makes sense to me, but...

>sites provide the RDFa...search monkey sees it

I couldn't find evidence that SearchMonkey sees any RDFa besides that which is embedded as attributes in object elements that point to Flash files. Other RDFa use by SearchMonkey depends on XSLT translation of that RDFa to DataRSS, which is what SearchMonkey is really using... right?


Hello Bob,

Rest assured, SearchMonkey does see the RDFa you add to a page. When the Yahoo! crawler hits your page, we extract any valid RDFa we find. For each URL, we store that data as a chunk of DataRSS XML. DataRSS is our way of normalizing between all the different types of structured data we might have for a page: RDFa, eRDF, various microformats, feeds, Delicious data, anything else.

If the DataRSS on a URL matches a pattern that we're expecting, then we automatically display that URL as an enhanced result -- that's our Flash video/documents/games functionality. Google Rich Snippets is the same thing, but for different use cases (like reviews, etc.) Rest assured, both teams are working to add more. :)

For arbitrary RDFa where we don't have an automatic presentation, you can use SearchMonkey to create a custom presentation. The SearchMonkey developer tool allows you to build a little PHP app that digs into the DataRSS XML using XPATH and tells Yahoo! Search how to display that data.

Note that you do not have to write any XSLT to use RDFa. You're right that if you create a BobFooBar format in your HTML, then we don't understand that format at all. Which means if you want to get at it using SearchMonkey, yes, you would have to build what we call an "XSLT Custom Data Service." But if you use RDFa, a format we do understand -- then we are essentially running that XSLT for you, at index time.

Finally, you can also call our BOSS Search APIs and get all our RDFa + other structured data back as DataRSS XML or RDF/XML (your choice). Basically, Yahoo! crawls the web harvesting structured data, and you can use BOSS to reflect that data back at you.

Best,

Evan Goer
Yahoo! SearchMonkey Team


Thanks Evan, this sounds more promising.

>For arbitrary RDFa where we don't have an automatic presentation

I assume that the RDFa where you do have an automatic presentation is a set of names from specific namespaces, e.g. dc:creator. Is this set documented somewhere? I get the impression from what you write that I can embed RDFa using these names as predicates into an HTML document, and that this metadata may show up as part of a search result.

>you can also call our BOSS Search APIs and get all our RDFa + other
>structured data back as DataRSS XML

If dc:creator is part of the set documented above, would this let me query the documents for which you have DataRSS metadata stored for dc:creator='Tim Berners-Lee' and have the documents returned if they're there? Including HTML documents as described above?


That's right, the automatic SearchMonkey presentations are triggered off of certain namespaces. For example, you can trigger a Video result using media:video and media:thumbnail. You can also change the title, abstract, etc. by including a dc:title or dc:description.

Viewing the metadata in search results: well, beyond fancy presentations, what we've got right now are some very crude filters.

With BOSS, you could create something slightly more powerful. You could say, "give me the top 100 results that have RDFa and have the term 'Tim Berners-Lee'". Then your BOSS app could sift through these results and return the URLs that have a dc:creator='Tim Berners-Lee'. But we don't yet support arbitrary SPARQL queries into the Yahoo! Search index. That's more like the "Web Of Objects" that our execs were talking about last month.


> the automatic SearchMonkey presentations are triggered off of certain
> namespaces....media:video... media:thumbnail... dc:title... dc:description.

Is there a comprehensive list of these namespaces and properties somewhere?

> But we don't yet support arbitrary SPARQL queries into the
>Yahoo! Search index.

That would be cool, but I think it would be much simpler to simply allow queries that return documents that have the RDFa equivalent of (>, p:foo, "bar") in them. You tell us what p:foo predicates we can use, we specify "bar", and you return each document that has p:foo="bar" in it.


For the automatic SearchMonkey presentations, all the namespaces and properties are scattered across the different documentation pages under http://developer.yahoo.search.com/start.

As for supporting a simpler query syntax: I'll bring it up to our architect!


I'm guessing that you meant http://developer.search.yahoo.com/start and not http://developer.yahoo.search.com/start.

Compiling those namespaces and properties into a single document would be a big boost to usage of SearchMonkey by the semantic web community considering how little work it would be.

Thanks again!


Just added searchmonkey product objects to our pages. Yahoo tells that products are found but we don't see it in search results. Very strange.