Archive for the ‘Kosmix Technology’ Category

October 28, 2009

Wikipedia and the Semantic Web – Part 2 »

About a month ago I posted (here) my thoughts about how Wikipedia can improve the Semantic Web. My take is that Wikipedia can provide a global and ever improving vocabulary bloggers and other content creators to provide richer context around what they write.

Several people contacted me after reading the post to ask about the best way to annotate their content, and to find out what else I think Wikipedia needs to do to make iteasier to create Semantic Web pages. The big question seemed to be:  What context can bloggers add so that search engines and others understand their posts?

I’ll use a simple scenario to illustrate my answer to this question.  Let’s say I am about to write a blog post on the healthcare debate. Obviously, I want to tell them I am talking mainly about the http://en.wikipedia.org/wiki/2009_US_healthcare_debate. And within that context I want to discuss thehttp://en.wikipedia.org/wiki/United_States_Democratic_Party and the http://en.wikipedia.org/wiki/United_States_Republican_Party. As you can see, Wikipedia provides me with a clear vocabulary to uniquely identify the different “Entities” that I want to talk about in my post. There is a unique URL to every entity. This will not work for entities that are not popular enough to have Wikipedia pages, but it is a good start. It is also only a small step over “Tagging”, a common way to annotate today.

Next, as I talk about different entities, I may want to explicitly state the connections I am making. For example, I’ll mention http://en.wikipedia.org/wiki/Michelle_Obama and want to add the fact that I am commenting on her impact as the “Wife” of the President in their personal relationship, and not as the First Lady. Wikipedia does give us a lot of information on how different entities are related to each other. However, the vocabulary is far less organized and many of these relationships do not have unique names. Some of the fact boxes at the bottom of Wikipedia pages called “Templates”, like the one at http://en.wikipedia.org/wiki/Template:United_States_topics, are even less structured and uniform. Wikipedia needs to evolve to a more structured hierarchy and schema for relationships. Without it, it will remain hard for content creators to add more rich information and make new relationships evident.

Lastly, I may want to annotate with information on what kind of content it is. Am I talking about some great “Videos” or “Documentaries”? Am I writing with a “Liberal” view? Am I discussing some recent “News”? Is this a “Review” of the administrations efforts? The last forte in Semantic Web is specifying the kind of content I am creating, instead of the topics of my content. Obviously, Wikipedia does not have the vocabulary that allows me to specify this and I must look elsewhere.

In the end, we have to take baby steps in our goal for rich semantic annotation of Web content. Automated tools are already attempting to do this for content that has already been created. Will the automated methods improve fast enough that there will never be a need for content creators to annotate? Or will having a vocabulary and an easy method of annotation give enough advantage to the content creators that we will see widespread adoption? My guess is that the answer lies somewhere in between.

More accurate annotation already allows better cross linking and makes it easier for users to find your content, both from search engines and other sources. It also allows innovative startups to use your content in rich ways and drive traffic to you. At the same time automated annotation techniques are improving. In the end a “Semi-Automated” solution that allows you to influence how your content is annotated and, with improving technology, reduces the effort it takes will be the winner.

digvijay
September 14, 2009

Why Wikipedia Can Make a Giant Leap Ahead for the Semantic Web »

A Giant Leap

Every time I want to look up facts, read about a topic, or am curious I go to Wikipedia. So does everyone else. Wikipedia is a brilliant idea for the greatest compendium of knowledge ever created. Unfortunately, for a machine it is a blob of data with very little meaning… all that information is too hard to understand.

Semantic Web is an idea that has been alive for a while. The Internet has revolutionized our access to information. If computers could access that information and process it, imagine how much more we could do with it. But computers are not as smart as humans and we need to talk to them in a different language. We need Web pages and blogs and news and email to be written in this language that computers can understand, a language far simpler and more structured than English. The Semantic Web’s goal is to create this language and make it possible for people who create content to use it everyday.

When did you last learn a new language? Can your neighborhood blogger read French? Or Chinese? New languages are hard to learn and content creators need a BIG incentive to master and use them. Google, Bing, or Kosmix could, for example, provide that incentive by ranking Semantic Web Pages higher, as they can understand them better.

All right, so we know why this language is needed and we have some idea of the incentives that will push people to learn it. But where is this language? Who defines it? The World Wide Web Consortium (W3C) has been defining standards like “RDF” and “OWL” that they hope will lead to this global language. But these standards are really an “Alphabet” that tell us how to write this language. What they don’t give us is a dictionary, a vocabulary called a “Schema” that will help computers understand this language.

What is needed and what the proponents of the Semantic Web have failed to create is this global “Schema” of types of things and their relationships. A vocabulary that works for all the content on the Web. Something that tells a program that an iPhone is a “Mobile Phone” which is a “Phone” which is a “Communication Device” and also a “Personal Electronics Gadget”. A “Schema” is easily extendable and the extensions are easily standardized. There were no “eBook Readers” a few years ago. Kindle is an “eBook Reader.” Both my computer and yours should call it an “eBook Reader.”

Who can create such a language? They would need the largest compendium of information ever created. They would need an easy way for the world to edit and change this compendium over time. They would need a process by which every piece of information in the compendium can be “defined” by a common schema. You see where I’m going here: They would need to be Wikipedia.

Frankly, this is an opportunity that Wikipedia has missed. Don’t get me wrong, Wikipedia has a lot of structure for humans, and lots of companies and researchers are writing sophisticated programs to understand this structure. But this structure isn’t yet complete or visible enough to be used by other content creators. If Wikipedia can evolve into a compendium of information that can also create and maintain this vocabulary, we can have another revolution with Wikipedia at the center. The amazing thing is, this is only a small step from where Wikipedia is today.

At Kosmix, we write sophisticated programs that understand pages on the Web, including Wikipedia. We want our programs to understand what people are writing so we can connect that information to those looking for it. But it would take decades for computing power and technology to grow enough to truly understand the English language. Another revolution in Wikipedia can skip the world ahead by a few decades.

digvijay
August 31, 2009

Moore’s Law and Web Search »

In 1965 then a co-founder of Fairchild Semiconductor, Gordon Earl Moore predicted that the number of silicon components on a single integrated circuit chip will double every 12 months. He later went on to start Intel with Robert Noyce and worked to make that prediction a reality.

True to his prediction the number of transistors on a microprocessor has been doubling at roughly 18 months for the past 45 years and this trend has come to be known as Moore’s lawIt’s estimated, for example, that the semiconductor industry produced more transistors in 2005 than the number of grains of rice produced that year.  For a deep look at the remarkable history of semiconductors explore this exhibit at the computer history museum web site.

Unlike Newtonian laws of motion or laws of electromagnetism, Moore’s Law is not one of nature but of human endeavor. In a symposium honoring the 40th anniversary of that prediction Moore himself stated that the law is based more on economics of the industry than on improvements in any particular material or process and has room to go on for another decade at least.  In 2009, Intel CEO Craig Barrett said “We can scale it down another 10 to 15 years. Nothing touches the economics of it.”  There has been ample speculation and analysis (like this one by author and blogger Andrew Curry) on when Moore’s law will end but another decade of it will surely lead to lot of creative destruction.  Former chief editor of Wired magazine Kevin Kelly goes so far as to ask was Moore’s Law inevitable?

Fundamentally Moore’s law upends status quo by doing two things. In every cycle (18 months to 24 months) the cost of silicon components falls by a factor and the performance improves by a factor. Such an exponential decrease in costs and increase in performance offers ample opportunities for nimble and new players to outsmart the slow and old ones and win customers. 

What are the implications of Moore’s Law for web search?

Web search as practiced by Google, Yahoo and Bing today essentially boils down to building this stack : crawling, indexing, ranking and presenting web sites as search results to a keyword search.  Each layer of this stack is affected by Moore’s Law differently and offers different opportunities.

  1. Crawling : Crawling  is inherently a massively parallel task. You maintain a huge distributed crawl queue and manage several processes pulling the data from the web based on that queue.  Cheaper hardware and bandwidth lowers the cost of doing such a crawl every cycle.  But unfortunately the amount of data on the web has also been growing exponentially. The web is said to have grown from 5 exa bytes of data in 2002 to 281 exa bytes of data today. One would need enormous cash reserves to fund a crawl that big despite cheapening hardware. Newcomers without access to such cash reserves are at a huge cost disadvantage.
  2. Indexing : Indexing is the task of organizing all the crawled data in ways that influence ranking and presenting.  Modern day search engines build what is called a reverse index that maps keywords to websites they are present in. The process of indexing is a massive, resource intensive task, and newcomers are usually at a cost disadvantage, despite cheapening hardware.
  3. Ranking : Ranking is the task of retrieving relevant results and ordering them.  The secret sauce of each search engine is said to reside here. It involves proprietary algorithms and is carefully controlled. Google is said to use about 200 signals that influence ranking.  In some ways Moore’s Law does not directly affect this layer even though most of the analysis that one would need for ranking is done during the indexing phase. Fortunately offerings like Yahoo Boss give newcomers the chance to improve upon the ranking without building a huge crawl or index and the cost of such services are coming down with each cycle.
  4. Presenting : Presenting is the task of rendering search results in a manner that’s efficient, useful and easily consumable. Despite the heavy lifting that happens at the backend in the previous layers what a typical web search consumer sees is one text box and a list of ten blue links on a sparse search results page.  Moore’s Law offers the biggest opportunity for newcomers here as consumers upgrade their personal computers each cycle.  Studies show that users expect to see results to their keyword searches in half a second. And –with Moore’s law– what one can do in that half a second improves exponentially every cycle, both on the client side and on the server side. What was not possible to do in the last cycle suddenly becomes possible.  With the rekindling of the browser wars tomorrow’s web browser will be vastly more powerful taking advantage of the next generation of multi core processors. They will offer much richer rendering and new forms of interaction.  Newcomers to web search will have the best opportunity to disrupt the status quo by riding that wave. One need not limit the interface to keyword search. With the explosion of touch and voice interfaces in the last year or so, new ways of input will emerge that will make today’s keyword search seem archaic.

posted by Manyam Mallela

manyam
August 11, 2009

Google Gets Amped on Caffeine »

Amped on caffeineThe internet is abuzz today with Google’s announcement on a next generation search project that the internet has fondly dubbed “Google Caffeine”. I read the news along with my own morning shot of caffeine, and it immediately increased my sense that exciting things are happening again in the search space.

Take the last few months. The launch of Bing, the MicroHoo deal, a search box taking prominent place on the twitter.com homepage, Google launching the “Wonder Wheel”, Facebook turning on Real-Time search, and today’s news on Google Caffeine. Clearly, the search industry itself seems to be on a heavy dose of caffeine recently.

Google Caffeine is probably a major backend release for the Google search engine,and, like any other release, it will be faster and better. After all, search engines need to constantly innovate and update to keep up with the competition, and Google has always done that better than anyone else.  You can try it here.I definitely see an incremental improvement in quality.

Things get more interesting when you consider that Google clearly must have several big releases for their search product every year. However, lately Google not only has to improve its product, it must also communicate these improvements and be “seen” to improve its product. After several years of stability, search is heating up, and the market leader must maintain its brand or lose it forever.

This is a typical cycle in any industry reliant on innovation.A short period of rapid innovation, followed by a longer period of consolidation and stability againfollowed by a rapid period of innovation. After years of the traditional search model, search seems to be entering its next period of rapid innovation, and all the big players, as well as the smaller ones, must innovate or be left behind.

So to me, in my moments of caffeine-induced clarity, I see “Google Caffeine” as another indication that we are about to see a revolution in search. My hope is that this will deliver on search’s many possibilities!

digvijay1

digvijay
August 10, 2009

Deep Web: The Hidden Treasure »

(Iceberg image ©Ralph A. Clevenger)

(Iceberg image ©Ralph A. Clevenger)

Experts estimate that search engines can access less than 1% of the data available on the Web, only the tip of the iceberg. Where is the rest of Internet’s data? It’s lurking in the Deep Web.

What is the Deep Web?

The Deep Web is defined as dynamically generated content hidden behind HTML forms that is inaccessible for search engines’ crawlers. Deep Web is also referred as the hidden or invisible Web.

The Deep Web consists of three key elements: (1) pages and databases accessible only through HTML forms; (2) disconnected pages not accessible to crawlers; and (3) password protected and subscription only sites. Some people also include real time Web data as a part of the Deep Web, since it’s changing so fast that traditional search engines are not able to surface it in their results.

How Vast is the Deep Web?

According to one study by Michael K. Bergman in 2000, the Deep Web accounted for 7,500 terabytes of data. At that time, search engines could index only 10s of terabytes of data. By 2004, a subsequent study by Kevin Chang and his colleagues estimated that the Deep Web had grown to more than 30,000 terabytes of data. At this rate, one can only imagine how vast it must be today, particularly given the ubiquity of the Internet over the past five years. Such an enormous amount of data has huge wealth of information—the key is figuring out how to access it.

Is it Possible to Access the Deep Web?

Absolutely—though it’s not easy. There are two main approaches to accessing Deep Web data:  run-time integration, and off-line indexing.

In the run-time integration approach, one has to build a system that performs the following tasks: first, figure out the appropriate forms that are likely to have results for the given query terms; second, map the query terms suitably to search those forms and integrate the results from various forms; and third, extract relevant parts of results to display. This approach enables richer experience for users, and sites like Cazoodle.com seems to rely on this method.

But there are some drawbacks to run-time integration. It’s extremely difficult to figure out appropriate forms for the given query terms. In addition, mapping query terms to search those forms and extracting information from the results is highly labor-intensive tasks and not very scalable.

In the off-line indexing approach to access Deep Web data, one has to construct a set of queries to search through forms, process the queries through forms while off-line, and index the result. Once the query set is constructed, this approach can reuse the search engine infrastructure for crawling, indexing results, and index serving.

Google has taken this approach to surface Deep Web content. However, algorithmically constructing input values for forms is a non-trivial task. Furthermore, this approach cannot be applied to HTML forms that use HTTP POST method, since the resulting URLs are the same, and form inputs are part of HTTP request rather then the URL.

The Kosmix Approach to the Deep Web

At Kosmix, we surface Deep Web content by using a combination of run-time integration and off-line indexing approaches. At the core of Kosmix technologies are (1) a sophisticated categorization engine that enables mapping of query to appropriate category; (2) a highly scalable fetching and run time integration system to fetch data from various sources, integrate, and provide rich experience; and (3) an off-line crawling and indexing systems that enables scalability.

For example, for a query like “Ravioli”, we show nutritional values from Fatsecret.com. Our categorization technology enables us to identify Ravioli as a food query, and enables us to surface Deep Web content from Fatsecret.com.

The Next Hurdle

While invaluable treasures are hiding behind the Deep Web, there are significant challenges to solving the problem of reaching this information. The next step for search engines will be to find an easier way to tap into the Deep Web, and to keep up with the Real Time Web.

My prediction? The Deep Web will force a drastic change in how traditional search engine systems are designed and built.

References:

[1]. Michael K. Bergman. White Paper: The Deep Web: Surfacing Hidden Value. http://brightplanet.com/index.php/white-papers/119.html, 2000.

[2]. Kevin Chen-chuan Chang, Bin He, Chengkai Li, Mitesh Patel, and Zhen Zhang. Structured Databases on the Web: Observations and Implications. In SIGMOD 2004.

[3]. Cazoodle. http://www.cazoodle.com/docs/Press_Kit.pdf, 2009.

[4]. Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy. Google’s Deep Web crawl. In VLDB 2008.

mitul21

mitul
May 1, 2009

Exploring Swine Flu and Other Unfamiliar Topics »

More and more folks are turning to Kosmix as a starting point for exploring an unfamiliar topic.  A good example is the recent H1N1 influenza A (a.k.a. swine flu) outbreak. Since Saturday when the news of the outbreak started heating up, our two sites, Kosmix and RightHealth, have seen more than 200,000 swine flu related searches.  Top searches?  Swine flu and swine flu symptoms

The swine flu page brings together a diverse set of information which you may not have easily discovered through a basic web search.  The page includes basic facts about the virus, advisories from the Centers for Disease Control, the latest news and blog posts, videos, podcasts, groups and discussions, and an interactive outbreak map.  Also, you can drill down into any of those information types or related topics to deepen your exploration.

By the way, in case you haven’t heard, you can’t catch the swine flu from eating pork products.  So feel free to dig into that tempting platter of ribs.

To learn more about swine flu and keep on top of the latest news, check out the swine flu topic page and the RightHealth DailyDose blog by Dr Steven Chang

tracy-chu-thumb

tchu
March 11, 2009

MeeHive is Today’s Buzz »

Today, Kosmix is introducing a cool new site: MeeHive.

My Top Stories

MeeHive is a personalized newspaper that brings you the latest about what’s happening in your world. You list the topics and issues you’re passionate about, and MeeHive will scour thousands of news outlets and millions of blogs to find stories about your interests. You can share articles with others, see what friends read, and even use your iPhone to get MeeHive on-the-go. It’s your news, your way.

Tracy's Hive

So why did we build MeeHive? MeeHive grew out of an observation that it was difficult for people to get news about their interests in a single place. How was I going to get daily news about Project Runway, global warming, and UCLA basketball? I could search on the Internet everyday for news about these topics, but that is clearly not scalable past 5 interests. I could set up an RSS reader, but I might have to wade through 100 articles from 10 blogs just to read 2 articles about the latest green device, amongst many other issues. I could set up individual news alerts for each interest, but the stories are not aggregated together. What we discovered was that each person needs a “personal news editor”. This personal editor knows about all the news related to your interests, and selects the most interesting news articles for you to read. Just take a look at my hive to see what stories my editor has picked for my 60+ interests at this point of time.

Sesh's Hive

Everyone’s MeeHive newspaper is unique. My colleague Sesh is vastly different from me. Sesh is our CTO at Kosmix, and he’s very passionate about technology, venture capital, Indian cricket, and carnatic music. He also has over 30 other interests. Take a look at his hive as well.

As we were building MeeHive, we felt that in addition to the personal editor, we also wanted our friends to bring to our attention interesting articles — thus acting as additional editors for me. Sesh is always recommending articles with his (tongue in cheek) comments and more often than not I have found myself caring more about an article when a friend has recommended it to me. For example, I found Sesh’s recommendation about the book industry interesting even though I don’t have a strong interest in Amazon. These article recommendations add an element of serendipity to my newspaper. Finally, we felt that the ability for people to look at others newspapers and draw inspiration for topics of interests they should be adding is very useful. I recently added my interest in Green Tea after seeing it in someone else’s hive.

So what other features will you find on MeeHive?
• Read more than just news. We also have editorials, blog posts, press releases, videos, tweets, and more!
• Receive your personalized top stories in a daily email or as an occasional tweet
• Add RSS feeds, for those publishers you like to follow

tracy

MeeHive has been in private beta for the past six months. We want to thank all our friends, family, and supporters for using the site and giving us very valuable feedback. We’re excited to open MeeHive up to everyone today. Check it out and let us know what you think!

tracy
January 16, 2009

Is One Search Engine Enough? Not By a Long Shot. »

I hear it all the time: the mistaken assumption that one search engine is all anyone needs to find information online. Only one way to find everything on the Web? Think again.

Today’s search landscape is far more fragmented than most people realize. Traditional search engines are only part of the story: the definition of a search engine has now expanded to include any website with a search box. Think of the way you shop for a book on Amazon, find an Italian restaurant in San Francisco on Yahoo Local, reconnect with your long lost high school pals on Facebook, or research the drug your doctor prescribed on RightHealth. Search engines are everywhere.

Because these different search engines have few connections to each other, users never see the vast majority of content that would be valuable to them for each query. No search engine—not even the big three—can surface every bit of useful content and present it in an easy-to-digest way. That’s the bad news. The good news is that search is going to have its revolution soon, which will completely change how we find information.

How will that revolution happen? The potential for innovation will cause a community to evolve around the search space, and search will become a platform.

The possibilities for innovation in search are huge. For example, someone will find a better way to let users specify their intent. Requested information will be returned in a more visual, intuitive layout. Results will be personalized better based on the user’s intent and the subject of the search. User interfaces will be tailored to the specific intent (Looking for videos?) or the more specific subject (Just want Health results?). Domain-specific experts will achieve a deeper understanding of both the content and the query, and will form richer connections between the two.

As the potential for innovation in Search grows, the space will attract a much larger pool of entrepreneurs than any one company can ever contain. Whenever a technology matures, communities naturally form around it. Take personal computers, for example. While ATI and NVidia fight over the Graphics capability, Intel and AMD compete over the chip. Dozens of companies provide monitors, keyboards, memory, motherboards and every other aspect of the PC. And hundreds if not thousands of companies can provide the different software applications.

Every widely-used technology has hundreds of companies involved in competing for the different parts that make up the whole, and the search space will be no different. Expect a flurry of increased funding for specific solutions, increased competition, and increased specialization. And the company that harnesses this momentum and gets companies engaged together in a search platform will find itself in the most enviable position.

What do we mean by a search platform? It will be an internet website that allows hundreds of search companies to provide specialized solutions to various search problems in a connected and integrated manner. For example, if you search for “Chocolate” the platform may connect you to a recipes search engine, a health search engine, and a shopping search engine allowing each to present specialized results with a richer interface than what today’s web search allows. It will allow each engine to innovate in its area of expertise and connect the three together in a meaningful way so the end user sees much richer results.

The platform creator will have the responsibility to match different search solutions to the user’s differing needs. What’s the best approach to achieve this? One option is to let the users choose from the various search providers and “install” the applications that best suit them. Another approach is to build a platform that understands the user and the different applications deeply, and can connect the user automatically to just the right application, at just the right time. My company, Kosmix, has built an early version of such a platform, which delivers the best of the Web by bringing together hundreds of content providers, aggregators, and niche search engines. We’re off to a good start, and this is just beginning.

For search to be viable as a platform in the long-term, it must offer value to everyone involved: the platform creator, the application providers, and the users. The platform creator and application providers need to make money, obviously, while at the same time offering consumers high-quality, easily accessible content for free. This can be achieved in one of three ways: The Amazon Model, the eBay Model, or the Facebook model.

In the Amazon model, the platform creator owns the underlying economics and is responsible for sharing the benefits with solution providers. Think of the way Amazon.com receives payments for anything a customer buys on the site and then, in turn, pays the sellers on the Amazon marketplace.

In the eBay model, solution providers own the revenue and share a part of it with the platform creator. eBay does this by charging its sellers a percentage fee for all items sold on the marketplace.

In the Facebook model, the platform creator and the solutions providers are independently responsible for their own benefits. Facebook gets an indirect benefit by making the experience on their site richer for their users, while allowing applications to display ads or generate traffic. In this case, the platform must offer opportunities for the owner to monetize his application in various ways. For example they may share traffic, share the advertising space between the two, etc.

All three of these models have merit, and it remains to be seen which direction a search platform will take. One thing remains clear: as search evolves it will become harder and harder for one company to do it all. A platform that connects hundreds of search engines together can become a powerful source of innovation allowing it to build a deeper and richer experience for its users. The innovation in search has only just started. We need millions of connected search engines.

digvijay
December 11, 2008

The future of search »

November 30th, 1900

Having read of the recent death of the Irish writer Oscar Wilde, a student at Harvard University is looking for information on his life and his achievements. How does he access this information? He looks at the library catalogue, skims through newspapers and magazines, and writes letters to his colleagues. After a few weeks of research he publishes a tribute to the life and works of Oscar Wilde.

August 5th, 2000

A century later, having read of the recent death of English actor and writer Sir Alec Guinness, a student at Harvard University is looking for information on his life and his achievements. The Wikipedia movement has not yet started and the researcher turns to the recently formed google.com search engine. After a few hours of searching for information, posting on bulletin boards and groups, and browsing websites dedicated to Sir Alec Guinness he is ready with a tribute to the life and works of Sir Alec Guinness.

The year 2100

What will search for information look like in the year 2100? If progress continues at the same rate as the last century, a student in 2100 should be able to go from the thought of writing the essay to the complete essay in a mindboggling 16 seconds. And the rate of progress is increasing.

Even with the best of information, making predictions about the future is a messy business. We can, however, look at the trends and learn from them. What does search look like in 2100 to allow one to go from thought to essay in 16 seconds? I assume that a person will be able to specify his information requirement in some form and a machine will instantly create an answering essay tailored to his or her needs. Instead of presenting a set of links, this answer will be like an automated Wikipedia-like page which will contain not just objective encyclopedic information but also subjective views, statistics, and several other kinds of information. Further, it will be possible for you to specify the extent of information you need, the different aspects of the topic you need covered, the tone of content, the target audience, and several other features that a student would use to make his essay better.

So you can, for example, ask “Give me the list of symptoms of diabetes”, “What is the phone number of my local Wal-Mart?”, “Write a 2 paragraph summary of the Harry Potter series”, “Write a two page essay on the scientific basis of speech in apes as mentioned in the book Congo by Michael Crichton”, “I recently heard about White Holes and want to learn more about the subject and related interesting things”. Current technology can come close to answering the first few questions but it gets harder as the questions get more complex. An ideal information extraction system would not only be able to answer all these questions but will be able to tailor the answers to your needs.

This may sound like a far off dream but we are clearly moving in a direction where a machine will automatically create the perfect article that precisely and completely covers the searched topic.

A search engine of the future

While search engines like Google, Yahoo, and Microsoft Live solve the first few questions above, human created content sites like Wikipedia are trying to do a better job with the later more complex questions by writing the most asked for answers. It is, however, clear that the system of the future will have to automate what Wikipedia is doing and more and do it in several different ways in order to satisfy every user’s need.

Let us try and understand the basic structure of this hypothetical system. On one end we have the users query with some extended specification. On the other end we have an extremely large amount of available content.

The first step this system needs to accomplish is to understand the query better. So we take the user’s question and determine what the subjects this query is interested in are, what the kinds of information that the user wants are, what is the tone of the answering essay, and what is the extent and depth of the returned content. So we know the user wants information on the book Congo and on the scientific basis of speech in apes. We also know he wants a two page essay and is interested in more authoritative scientific sources.

The second step is to take all the available content and understand what its subjects of discussion are, what kinds of information it contains (encyclopedic, user discussions, scientific papers etc.), what kind of audience and tone it is relevant for, etc. So we may find sites that talk about the book Congo, about the speech capabilities of Apes, about Michael Crichton, about Apes in general, etc. We will also be able to say if the information is scientific and has good references, is a discussion with opinions from several people, is a source of images or other media, or is a source of papers or journals etc.

The third step is to connect the subjects in the query with the subjects in the content. When doing this we need to work within the specified level of detail. So the system will make decisions on whether the article should be limited to a short note on speech capabilities of apes and the truth behind the book Congo. Or it could go into more detail on Michael Crichton and his style of writing, the story behind the book Congo, information about Apes in general etc. It will also have to decide if it should stick to scientific articles or if it should delve into opinions, look at videos and documentaries etc.

The final step is to use this information to write an article that is coherent, well organized, and easy to read. The system has to organize the content, the references, and create relevant sections like a human would do. The final article needs to have the right tone, have the right style, be rich in content, and well organized. It also needs to be the right length, starting from a one word answer to a one sentence answer to a multiple page essay.

The search engine of today

Where do we stand as of today? Google, Yahoo, and other search engines are getting better at the one word and the one sentence answers. Wikipedia, About.com, and other editorial content sites are trying to pre-create answers to as many questions as they can. Semantic web, online taxonomies, and other efforts are working on making the content richer so it is easier for machines to understand it. All of these need to come together in completely new ways over the next century to achieve the vision outlined above.

My company, Kosmix.com recently launched a product that I believe is another small step in this journey. Our algorithms are trying to answer the questions which require you to write an essay, a term paper, or explore a topic in detail. They first figure out what subjects the query is interested in. They determine the various intents that the user can have. They look at all the content available on the web to understand the subject of the content, the type of information it represents, etc. The algorithms then make the connections between the various intents of the query and the available content to figure out what the best content for you is. They then organize the chosen content into sections that are meaningful and easy to understand, order the sections with the most relevant content at the top, and summarize the information correctly.

The hope is that Kosmix can present several different perspectives to any topic, can reach those hard to find rare gems for any topic, and can find interesting and surprising relationships for you to explore. In the end we hope that Kosmix can help you answer the really complex questions which require you to explore a topic in detail.

It is a very early and nascent attempt at the technology of the future. We are trying to help you write that essay, explore that topic, or simply browse the web by following interesting and surprising connections. We are clearly not competing with other search engines for the one word and the one sentence answer. Instead, we are trying to help you explore and discover.

How well do we do? I am proud and surprised at how far we have come. Of course, we have a long way to go and each incremental piece of content, improved categorization, and better organization takes us closer to our goal.

Some day we hope to write that essay for you!

digvijay
June 30, 2008

Why MeeHive Should Be On Your Radar »

By: Nicky.

In my last blog post, I detailed in a rather cryptic fashion the concept of a ‘Personalized News Dial Tone.’ I explained that in much the same way that your phone instantly connects you to the people in your life, Kosmix is working on a product that instantly connects you to all of your news interests that change around you every moment of the day.

Since it’s Monday morning and I have 2.5 cups of coffee (read: personality) running blissfully through my veins, now seems the perfect time to tell you a bit more about what we’ve been doing with this product – which we’ve named MeeHive.

Why MeeHive? Well, think of a Bee Hive, a place that is so full of frenzied activity that it literally buzzes, and then imagine that we offered you your very own hive where you could collect stories that interest you. That would make you buzz, wouldn’t it?

We debuted MeeHive at last month’s Under the Radar conference held at the Microsoft Campus in Mountain View, CA. Under the Radar is dedicated to showcasing the industry’s up-and-coming players – the startups who are developing some of the freshest and most creative products out there.

Sesh, our fearless CTO, presented MeeHive as part of the ‘Graduate Circle,’ a forum for established companies like Kosmix to discuss how they got to be where they are and how they are continuing to innovate.

During his well-received presentation, Sesh described how in a world of ‘pull’ models, where you search the web high and low to get the information you want delivered to you, MeeHive is a news ‘push’ model – delivering fresh information to you all the time so that you don’t have to go looking for it.

He noted that with MeeHive’s ability to leverage the Kosmix user base and deliver uber-relevant results, it is well-positioned for success. Of course, we know that getting MeeHive to where we want it is a marathon, not a sprint, so we’ll be spending the summer building the most robust product we can in preparation for a launch not too far down the road. In the meantime, sign up for our beta and we’ll keep you posted on developments.

admin