Archive for December 2008

December 11, 2008

The future of search »

November 30th, 1900

Having read of the recent death of the Irish writer Oscar Wilde, a student at Harvard University is looking for information on his life and his achievements. How does he access this information? He looks at the library catalogue, skims through newspapers and magazines, and writes letters to his colleagues. After a few weeks of research he publishes a tribute to the life and works of Oscar Wilde.

August 5th, 2000

A century later, having read of the recent death of English actor and writer Sir Alec Guinness, a student at Harvard University is looking for information on his life and his achievements. The Wikipedia movement has not yet started and the researcher turns to the recently formed google.com search engine. After a few hours of searching for information, posting on bulletin boards and groups, and browsing websites dedicated to Sir Alec Guinness he is ready with a tribute to the life and works of Sir Alec Guinness.

The year 2100

What will search for information look like in the year 2100? If progress continues at the same rate as the last century, a student in 2100 should be able to go from the thought of writing the essay to the complete essay in a mindboggling 16 seconds. And the rate of progress is increasing.

Even with the best of information, making predictions about the future is a messy business. We can, however, look at the trends and learn from them. What does search look like in 2100 to allow one to go from thought to essay in 16 seconds? I assume that a person will be able to specify his information requirement in some form and a machine will instantly create an answering essay tailored to his or her needs. Instead of presenting a set of links, this answer will be like an automated Wikipedia-like page which will contain not just objective encyclopedic information but also subjective views, statistics, and several other kinds of information. Further, it will be possible for you to specify the extent of information you need, the different aspects of the topic you need covered, the tone of content, the target audience, and several other features that a student would use to make his essay better.

So you can, for example, ask “Give me the list of symptoms of diabetes”, “What is the phone number of my local Wal-Mart?”, “Write a 2 paragraph summary of the Harry Potter series”, “Write a two page essay on the scientific basis of speech in apes as mentioned in the book Congo by Michael Crichton”, “I recently heard about White Holes and want to learn more about the subject and related interesting things”. Current technology can come close to answering the first few questions but it gets harder as the questions get more complex. An ideal information extraction system would not only be able to answer all these questions but will be able to tailor the answers to your needs.

This may sound like a far off dream but we are clearly moving in a direction where a machine will automatically create the perfect article that precisely and completely covers the searched topic.

A search engine of the future

While search engines like Google, Yahoo, and Microsoft Live solve the first few questions above, human created content sites like Wikipedia are trying to do a better job with the later more complex questions by writing the most asked for answers. It is, however, clear that the system of the future will have to automate what Wikipedia is doing and more and do it in several different ways in order to satisfy every user’s need.

Let us try and understand the basic structure of this hypothetical system. On one end we have the users query with some extended specification. On the other end we have an extremely large amount of available content.

The first step this system needs to accomplish is to understand the query better. So we take the user’s question and determine what the subjects this query is interested in are, what the kinds of information that the user wants are, what is the tone of the answering essay, and what is the extent and depth of the returned content. So we know the user wants information on the book Congo and on the scientific basis of speech in apes. We also know he wants a two page essay and is interested in more authoritative scientific sources.

The second step is to take all the available content and understand what its subjects of discussion are, what kinds of information it contains (encyclopedic, user discussions, scientific papers etc.), what kind of audience and tone it is relevant for, etc. So we may find sites that talk about the book Congo, about the speech capabilities of Apes, about Michael Crichton, about Apes in general, etc. We will also be able to say if the information is scientific and has good references, is a discussion with opinions from several people, is a source of images or other media, or is a source of papers or journals etc.

The third step is to connect the subjects in the query with the subjects in the content. When doing this we need to work within the specified level of detail. So the system will make decisions on whether the article should be limited to a short note on speech capabilities of apes and the truth behind the book Congo. Or it could go into more detail on Michael Crichton and his style of writing, the story behind the book Congo, information about Apes in general etc. It will also have to decide if it should stick to scientific articles or if it should delve into opinions, look at videos and documentaries etc.

The final step is to use this information to write an article that is coherent, well organized, and easy to read. The system has to organize the content, the references, and create relevant sections like a human would do. The final article needs to have the right tone, have the right style, be rich in content, and well organized. It also needs to be the right length, starting from a one word answer to a one sentence answer to a multiple page essay.

The search engine of today

Where do we stand as of today? Google, Yahoo, and other search engines are getting better at the one word and the one sentence answers. Wikipedia, About.com, and other editorial content sites are trying to pre-create answers to as many questions as they can. Semantic web, online taxonomies, and other efforts are working on making the content richer so it is easier for machines to understand it. All of these need to come together in completely new ways over the next century to achieve the vision outlined above.

My company, Kosmix.com recently launched a product that I believe is another small step in this journey. Our algorithms are trying to answer the questions which require you to write an essay, a term paper, or explore a topic in detail. They first figure out what subjects the query is interested in. They determine the various intents that the user can have. They look at all the content available on the web to understand the subject of the content, the type of information it represents, etc. The algorithms then make the connections between the various intents of the query and the available content to figure out what the best content for you is. They then organize the chosen content into sections that are meaningful and easy to understand, order the sections with the most relevant content at the top, and summarize the information correctly.

The hope is that Kosmix can present several different perspectives to any topic, can reach those hard to find rare gems for any topic, and can find interesting and surprising relationships for you to explore. In the end we hope that Kosmix can help you answer the really complex questions which require you to explore a topic in detail.

It is a very early and nascent attempt at the technology of the future. We are trying to help you write that essay, explore that topic, or simply browse the web by following interesting and surprising connections. We are clearly not competing with other search engines for the one word and the one sentence answer. Instead, we are trying to help you explore and discover.

How well do we do? I am proud and surprised at how far we have come. Of course, we have a long way to go and each incremental piece of content, improved categorization, and better organization takes us closer to our goal.

Some day we hope to write that essay for you!

digvijay
December 8, 2008

Kosmix goes Beta (-ish) »

We’re excited to announce the launch of a beta, or beta-ish, version of our product today! At the same time, we’re also pleased to announce some additional funding. You should check out Anand’s post, which focuses on the funding and the vision, here. I’ll try to go into some more detail around the improvements in the product.

“What’s new?”,  you may ask. Here are some of the most interesting features:

•   New homepage: When you search for a topic, we organize the best of the web around the topic. So what should we do on the homepage, when you haven’t yet entered a topic? Well, we thought we’d organize the hottest items on the web across every topic. With the best of news, videos, images, and shopping all in one place, what better starting page could you ask for? For good measure, we also threw in the hottest topics we’re mining from our news and blog corpus. We hope that you will find this new homepage to be a great way to start their day.

•   Improved Relevance: To provide a better experience around searching for topics, we now sort the different kinds of information we show you based on their relevance to the topic. For instance, you’ll see shopping right at the top for a query like http://www.kosmix.com/topic/ipod_nano, whereas a query like http://www.kosmix.com/topic/Sarah_Palin will have news and videos at the top. Similarly, queries around music will bring audio to the top, food and recipe queries will bring recipes to the forefront, stock tickers will bring up stock charts, travel destinations will show you maps near the top, and so on.

•   Disambiguation: We have taken our first steps towards disambiguating between different intents for a query. For example, you can choose between Cobalt the element, the color and the car by clicking on the choices in the menu at the top: http://www.kosmix.com/topic/Cobalt. This feature is still in its infancy, and we hope to roll out significant improvements in the coming months. Similarly, searching for Tahoe lets you disambiguate between Lake Tahoe and Chevy Tahoe.

•   “At a Glance” & “Topic Highlight”: To give you quick ways to digest a lot of information, we now summarize the topic in the all-new “at a glance” section.  Also, whenever we can, we try to show you a “topic highlight”, which’s our best guess at the most relevant content for that topic. For example, check out the topic highlight on http://www.kosmix.com/topic/saturday_night_live

•   More pathways to explore and browse: We’ve provided you easier ways to navigate to related topics. You can see this by looking at the “Related Material” section on the page http://www.kosmix.com/topic/Stonehenge – you’ll see useful content “from topic: Bath, Somerset”. We have also added a “preview” of the related topics that appear in the “Related in the Kosmos” section to make it easier for you to find related topics that are just right for you  – just click on the magnifying glass logos next to any of the topics in the Kosmos and you’ll get a quick preview of what the related topic is about.

•   Lots more content: We’re excited about having added hundreds of new content sources, and are thankful to many new partners who’ve given us access to their APIs, feeds or widgets. With this, you’ll start seeing a lot more niche sites on Kosmix searches for topics.

In addition to all these, we also make it easier for you to bookmark and share any topic you like. Simply click the bookmark/share link on the top/right of the page, and email the page to your friends or bookmark it on any of your favorite sites.

We’re hard people to please, and we realize this product still has a long way to go (hence the beta-ish label).  Keep watching out for many more improvements to come soon and keep sending us your valuable feedback.

vijay
December 8, 2008

Kosmix Adds Rocketfuel to Power Voyage of Exploration »

By: Anand Rajaraman

Scootin'

Scootin’

Today I’m delighted to share some fantastic news. Kosmix has raised $20 million in new financing to power our growth. Even more than the amount of financing, I’m especially proud that the lead investor in this round is Time Warner, the world’s largest media company. Our existing investors Lightspeed, Accel, and DAG participated in the round as well. The Kosmix team also is greatly strengthened by the addition of Ed Zander as investor and strategic advisor. In an amazing career that spans Sun Microsystems and Motorola, Ed has repeatedly demonstrated leadership that grew good ideas into great products and businesses. His counsel will be invaluable as we take Kosmix to the next level as a business.

In these perilous economic times, the funding is a big vote of confidence in Kosmix’s product and business. Kosmix web sites attract 11 million visits every month, and we have a proven revenue model with significant revenues and robust growth. RightHealth, the proof-of-concept we launched in 2007, grew with astonishing rapidity to become the #2 health web site in the US. These factors played a big role in helping us close this round of funding with a healthy uptick in valuation from our prior round. Together with the money already in the bank from our prior rounds, we now have more than enough runway to take the company to profitability and beyond.

A few months ago, we put out an alpha version of Kosmix.com. Many people used it and gave us valuable feedback; thank you! We listened, and made changes. Lots of changes. The result is the beta version of Kosmix.com, which we launched today. What’s changed? More information sources (many thousands), huge improvements in our relevance algorithms, a much-improved user interface, and a completely new homepage. Give it a whirl and let us know what you think.

To those of you new to Kosmix, the easiest way to explain what Kosmix does is by analogy. Google and Yahoo are search engines; Kosmix is an explore engine. Search engines work really well if your goal is to find a specific piece of information — a train schedule, a company website, and so on. In other words, they are great at finding needles in the haystack. When you’re looking for a single fact, a single definitive web page, or the answer to a specific question, then the needle-in-haystack search engine model works really well. Where it breaks down is when the objective is to learn about, explore, or understand a broad topic. For example

– Looking to bake a chocolate cake? We have recipes, nutrition information, a dessert burn rate calculator, blog posts from chow.com, even a how-to video from Martha Stewart

– Loved one diagnosed with diabetes? Doctor-reviewed guide, blood sugar and insulin pump slide shows, calculators and risk checkers, quizzes, alternative medications, community

–Traveling to San Francisco? Maps, hotels, events, sports teams, attractions, travel blogs, trip plans, guidebooks, videos!

– Writing an article on Hillary Clinton? Bio, news, CNN videos, personal financial assets and lawmaker stats, Wonkette posts, even satire from The Onion.

– Into Radiohead? Bio, team members, albums, tracks, music player, concert schedule, videos, similar artists, news and gossip from TMZ.

– Follow the San Francisco 49ers? Players, news from Yahoo Sports and other sources, official NFL videos and team profiles, tickets, and the official NFL standings widget.

In the examples above, I’m especially pleased about the way Kosmix picks great niche sources for topics. For example, I hadn’t heard about chow.com or known that Martha Stewart has how-to videos on her website. Other “gems” of this kind include Jambase, TMZ, The Onion, DailyPlate, MamaHerb, and Wonkette. Part of the goal of Kosmix is to bring you such gems: information sources or sites you have either not heard of, or just not thought about in the current context.

In other words: Google = Search + Find. Kosmix = Explore + Browse. Browsing sometimes uncovers surprising connections that you might not even have thought about. The power of the model was brought home to me last week as I was traveling around in England. I’d heard a lot about Stonehenge and wanted to visit; so of course I went to the Kosmix topic page on Stonehenge. In addition to the usual comprehensive overview of Stonehenge, the topic page showed me places to stay in Bath, Somerset (which happens to be the best place to stay when you’re visiting Stonehenge). It also showed me other ancient monuments in the same area I could visit while I was there. Score one for serendipity.

Some of us remember the early days of the World Wide Web: the thrill of just browsing around, following links, and discovering new sites that surprise, entertain, and sometimes even inform. We have lost some of that joy now with our workmanlike use of search engines for precision-guided information finding. We built the new Kosmix homepage to capture some of the pleasure of aimless browsing — exploring for pure pleasure. The homepage shows you the hot news, topics, videos, slide shows, and gossip of the moment. If you find something interesting you can dive right in and start browsing around that topic. We compile this page in the same manner as our topic pages: by aggregating information for many other sources and then applying a healthy dose of algorithms. Dig in; who knows what surprises await?

How does Kosmix work its magic? As I wrote when we put out the alpha, the problem we’re solving is fundamentally different from search, and we’ve taken a fundamentally different approach. The web has evolved from a collection of documents that neatly fit in a search engine index to a collection of rich interactive applications. Applications such as Facebook, MySpace, YouTube, and Yelp. Instead of serving results from an index, Kosmix builds topic pages by querying these applications and assembling the results on-the-fly into a 2-dimensional grid. We have partnered with many of the services that appear in the results pages, and use publicly available APIs in other cases. The secret sauce is our algorithmic categorization technology. Given a topic, categorization tells us where the topic fits in a really big taxonomy, what the related topics are, and so on. In turn, other algorithms use this information to figure out the right set of information sources for a topic from among the thousands we know about. And then other algorithms figure out how to lay the information on the page in a 2-dimensional grid.

While we are proud of what we have built, we know there is still a long way to go. And we cannot do it without your feedback. So join the USS Kosmix on our maiden voyage. Our mission: to explore strange new topics; to discover surprising new connections; to boldly go where no search engine has gone before!

saumil