November 13, 2009

Google, Kosmix, and The Deep Web – A Love Triangle

alon halevyanand rajaraman

Alon Halevy of Google Labs and Anand Rajaraman of Kosmix went after the Deep Web in their own separate ways last night, at the SDForum Search SIG in Palo Alto.

Alon and Anand are long-time collaborators in solving the Deep Web problem, and their joint presentation last night had the all the easy familiarity and good-natured competition you find with friends who go way back. Years ago, Anand’s VC firm, Cambrian Ventures, funded a company that Alon founded called Transformic Inc. Transformic, which built technology to crawl HTML forms, was later acquired by Google. Alon joined Google Labs, and Anand went on to found Kosmix with his business partner, Venky Harinarayan.

The Deep Web is simply the Web behind HTML forms. If you want to buy a car, for example, you might visit Cars.com and search for a used Toyota Prius, priced at less than $15,000 and located near Palo Alto, California. Cars.com will turn your query into an HTML page to present the results to you. A search engine won’t be able to see the page, however, because it was created just for you from a series of databases. The page becomes “lost” in the Deep Web. Tim Berners-Lee also explains in this TED video how leveraging such hidden data will drive the next innovation on the web.

According to one study, the Deep Web is estimated to be 500 times larger than the surface Web. As the number of dynamic websites and applications increase, this number will only go up. Imagine…all that data is not available to search engines!

Google’s Approach to the Deep Web

Google’s approach to the Deep Web is to find HTML forms, send input to these forms, and index the resulting HTML pages. Simple? Not quite. How do you discover these forms? Which forms do you pick? What inputs do you send to these forms? How do your parse the structured data in the result pages?

Google takes the “Less is More” approach. They drop forms used for transactions such as credit-card purchases, interactions that the computer science community calls “POST”. To send inputs to a form Google first tries well-defined lists such as zip codes, if present. Otherwise, they compile inputs using iterative-probing to discover what to send to a form. In Alon’s experience, only a small percentage of the Deep Web qualifies for indexing. This slice, however, is hugely valuable, as it is helping to answer 1000 queries a second! Google’s approach to the Deep Web is language independent, is fully automated to scale easily, answers body and tail searches, and fits nicely with the crawl infrastructure. For further insights, read Alon’s VLDB paper published in 2008.

Kosmix’s Approach to the Deep Web

After Alon shared Google’s perspective, Anand explained that Kosmix has taken a very different approach to the Deep Web: the federated way.

Unlike Google, Kosmix does not crawl HTML forms. Instead, for any given search query, Kosmix taps into these forms in real-time through API calls, evaluates the results and organizes them into a topic page. If you wanted to look up “Pumpkin Pie” on Kosmix, for example, the system would bring you fresh content from recipe sites like the Food Network, “How To” baking videos, real-time tweets about pumpkin pie from Twitter, and information about the caloric profile of pumpkin pie from diet sites like FatSecret. A query for “AdMob,” on the other hand, will call services like CrunchBase for a company profile and Fool.com for up-to-date investor information. To provide the most relevant topic page and also avoid overwhelming these different services with too many API calls, the Kosmix system is smart enough to know which type of services to call for which query. Thus, the query for “Pumpkin Pie” would never be routed to Crunchbase. A important enabling factor for the federated approach.

So how does Kosmix decide which Web service to route a query too? The answer lies with Kosmix’s categorization technology. Over the past three years, Kosmix has created a taxonomy of several million nodes, which we organized into a graph, using a combination of humans and algorithms. Editors discover, integrate, and tag Web services to taxonomy nodes in a semi-automated fashion. Algorithms route the user’s query through the set of taxonomy nodes, which enable the engine to decide which Web service to call.

After outlining the benefits of this approach, Anand dived deeper into the need to select the right sources, and touched on the challenge of discovering and integrating data sources, layout, rankings, etc -details about which can be found in this year’s VLDB paper. Anand also explained how the federated approach is keeping pace with emerging Web trends like real-time, the explosion of Web APIs, different content types such as videos, maps, etc.

Digging Even Deeper
Last night’s audience—about 50 specialists in the search space from some of the Valley’s leading companies and startups– was some of the most engaged groups I have ever seen. Questions ranged from business models to how to do multi-way join between HTML tables. Some people even were contributing ideas. If the Deep Web is important to you, then this was a place to be.

Both Google and Kosmix have compelling yet contrasting approaches to the Deep Web. It will be interesting to see if there is a winner or simply a combination of the two.

abhishek

12 Responses to “Google, Kosmix, and The Deep Web – A Love Triangle”

  1. Shree Pragada Shree Pragada Says:

    Digvijay Lamba of Kosmix had a different approach (more of a philosophy) compared to Alon Halevy of Google Labs and Anand Rajaraman of Kosmix for searching the Deep Web.

    The Deep Web is orders of magnitude larger than the surface Web and Lamba says “no one search company can search it all — especially if you are looking for meaningful results”. To search this vast Web, search must become a platform like Amazon Marketplace or eBay offering value to everyone involved: the platform creator, App publishers, and users. I think App Store is a better example.

    Recently, my company ExeCue has launched http://WWW.SEMANTIFI.COM as a Data Search Portal. Just as App Store is a platform for mobile Apps, we envision SEMANTIFI as a platform where anyone can build Apps searching any datasets or entire verticals using ExeCue’s search technology.

    Unlike Google, Bing and other search engines which search Web Pages, SEMANTIFI can search content inside databases on the Web via publishers’ Apps.

    With ExeCue as the platform creator, a diverse community of publishers and enough number of Apps covering most web verticals, the structured content can be wired together one dataset at a time making Deep Web Search a reality.

    ExeCue built initial Apps to search SEC Filings and Analyst Ratings data of publicly traded companies. Check http://www.semantifi.com. ExeCue has already signed up publishers to build FDIC data search App to investigate performance of banks and their peers; to build FRED App to research over 22,000 Federal Reserve Economic metrics. More “data search Apps” will be added in the near future.

    Appreciate your comments.

  2. Prakash S Prakash S Says:

    Is there a video of the talk? Thanks!

  3. Smart Mobs » Blog Archive » The hidden internet Smart Mobs » Blog Archive » The hidden internet Says:

    [...] which search engines are bringing to the surface. I don’t know, to be honest, what fraction. [also see: Kosmix Blog posting on Rajaraman about the hidden [...]

  4. The hidden internet » iThinkEducation.net! The hidden internet » iThinkEducation.net! Says:

    [...] which search engines are bringing to the surface. I don’t know, to be honest, what fraction. [also see: this Blog posting on Kosmix’s Approach to the Deep [...]

  5. {Important|Valuable} {gift|info} for {anybody|anyone} who {needs|wants} {one way backlinks|backlinks} for no {charge|cost}. {Anyone|Anybody} {need|want} free {one way backlinks|backlinks} for their {blog|webite}? I {thought|figured} I {might|would} {share {Important|Valuable} {gift|info} for {anybody|anyone} who {needs|wants} {one way backlinks|backlinks} for no {charge|cost}. {Anyone|Anybody} {need|want} free {one way backlinks|backlinks} for their {blog|webite}? I {thought|figured} I {might|would} {share Says:

    Important info for anyone who wants one way backlinks for no cost. Anyone want free backlinks for their blog? I figured I would share some great info I found recently. Free backlinks for your website. I have been using this for my websites and it absolutely works great! Click my name to see what I mean. Not selling anything, it’s totally free of charge and it works.

  6. Online Online Says:

    You aint got nothing on me…(run)…

  7. TopGearStreaming TopGearStreaming Says:

    Nice, love your articles. Just favd your site :)

  8. Janice Mclane Janice Mclane Says:

    Hey thanks… What a This is a good blog. Gnarly opinions as well. Thanks!

    buy links

  9. Randall Iverson Randall Iverson Says:

    From America – Thank you India!

    I am a 62 year old American from Chicago who has spent his life working at primarily Fortune 500 companies. I have been fascinated by the ingenuity of Indians since you have freed yourselves from some of the shackles of socialism. I was prompted to write to my one billion friends
    in India after watching two presentations at TED conferences by Indians
    that were coming up with incredible free market solutions to pressing problems. Specifically, Education, Healthcare and Government Corruption. If you or anyone you know would be interested in sharing
    thoughts with an American, I would like to correspond with you. Not quite sure how to do that without exposing myself to problems by publishing my
    email address.

    Thank you, again, India, for your ever increasing contributions to alleviate
    poverty and suffering in the world. Some of us in America are extremely
    grateful.

    Randall Iverson

  10. Sarah O Sarah O'Neil Says:

    http://www.paddsolutions.com/wordpress-theme-magasin-tres/

  11. Glady Matinez Glady Matinez Says:

    Hey just thought that I let you know that I am finding difficulty reading this blog via my blackerry so you might want to check on that. cheers!

  12. Thane Rutledge Thane Rutledge Says:

    Good info here. If Kosmix has created a taxonomy of several million nodes, won’t that allow for deeper penetration of forms? What about security?

    Thanks,
    Thane R.
    Cayenne Pepper

Leave a Reply