Archive for September 2009

September 30, 2009

Organizing the Web around Concepts »

During the initial days of the web, directories like Yahoo manually organized the web to find the relevant information. As web grew in size and search engine technology evolved, search engines like Google became the main source to query the web. Today, we see the next wave is making web navigation easier by reorganizing the Internet by topic or concept, and increasingly meaningful web (which may lead to Semantic Web) is being built around concepts such as Freebase, Google Squared, DBLife, and Kosmix topic pages.  At Kosmix, we’re often asked about the technical philosophy driving this change.  Here is a brief overview for the geeks among us.

To start with, what do we mean by concepts? A concept is loosely defined as a set of keywords of interest, for example, the name of a restaurant, cuisine, event, name of a movie, etc. There are various websites tailored to a particular kind of concept such as Yelp for restaurants (e.g., Amarin Thai), IMDB for movies (e.g., The Shawshank Redemption), LinkedIn for professional people, Last.fm for music (e.g., U2), etc.

Why should one care about organizing the web around concepts? There are three main kinds of web pages: search pages, topic/concept pages, and articles. Organizing the web around concepts can benefit each one of them.

Search pages. A search results page for a given query consists of various relevant links with snippets, for example, Google search results pages on “Erykah Badu”. Web data around concepts can improve search results in two ways. First, a search page can show a bunch of concepts related to the query, and their relationships to the query. This will help in further refining the query, and enable exploration of concepts related to the query.  Second, a search page can promote the concept page result for a concept closely matching the query.

Concept/Topic pages. A topic page or concept page organizes information around a concept, for example consider this music artist page on “Erykah Badu”. Such pages can utilize attributes of concepts, and show content related to the concept and its attributes, such as, albums, music videos, songs listing, album reviews, concerts, etc.

Articles. Articles can put semantic links to the concepts present in the article, and promote exploration of concepts present in the article, for example, this page on oil prices.

Given so many benefits of arranging the web around concepts, how can we achieve that? Some of the ways to arrange the web around concepts are as follows.

1. Editorial: An editor can pick a set of interested concepts, create attributes of the concepts, and organize the data around the concepts. Many sites like IMDB (for movies) have taken this approach. This approach gives high quality content but it’s not scalable in terms of the number of concepts.

2. Community: Many sites such as Wikipedia and Yelp have taken this approach in which a community of users picks concepts, creates the attributes of the concepts, and organizes the data around the concepts. This process scales as the user community grows, but it is hard to build such community, this approach is susceptible to spam, and scale is limited. For example, Wikipedia has grown to millions of concepts with such a large user base, but it size is still far from the scale of the web.

3. Algorithmic approach:  One way to organize the web around concepts is to mine the web for concepts and their attributes, and link data with concepts. This approach is the most promising in terms of scaling to the size of the web. Various steps in this approach are (a) Concept Extraction, (b) Relationship mining, and (c) Linking data with concepts.

(a) Concept Extraction. There are two main methods for concept extraction from web pages, site-specific and category-specific.

In the site specific method, the structure or semantics of a site is used to extract concepts. Many web sites generate HTML pages from the databases through a program, and such pages have similar structure. One can write site specific rules or wrappers to extract interesting data from such web pages, but writing such wrappers is labor intensive task. Kushmerick et. al. have proposed wrapper induction technique to automatically learn wrapper procedures based upon samples of such web pages. A recent work by Dalvi et. al. extends the wrapper induction technique to dynamic web pages. Another site specific method is to use natural language processing to understand semantic of web pages and to mine concepts from web pages.

In the category specific method, web pages are classified into categories, such as, restaurants, shopping, movies, etc., and category specific extraction rules are applied. For example, extract menu, reviews, cuisine, location for restaurants; extract price, reviews, ratings for shopping category; and extract actors, director, ratings for movies. This method is more scalable in terms of the number of web pages compared to the site specific method, but slightly more error prone since classification and extraction errors accumulate.

(b) Relationship mining. After extracting interesting concepts, one needs to match them with concepts in the database to create attributes, to grow concepts, and to find relationships between concepts. Some web databases like Freebase provide substantial amount of relationships between Wikipedia concepts.

(c) Linking data with concepts. As mentioned earlier, organizing web around concepts can benefit experience with search pages, topic pages, and article pages by linking them with concepts.

The algorithmic approach to organizing the web around concepts is somewhat error prone, though it improves as algorithms for a particular step improves. However, it is most promising in terms of scaling to enormous web that exists.

In short, organizing the web around concepts is a promising area and a stepping stone to bring meaning behind the web data.

References

[1] Nicholas Kushmerick, Daniel S. Weld, Robert B. Doorenbos: Wrapper Induction for Information Extraction. IJCAI (1) 1997.

[2] Nilesh N. Dalvi, Philip Bohannon, Fei Sha: Robust web extraction: an approach based on a probabilistic tree-edit model. SIGMOD Conference 2009.

mitul
September 30, 2009

The Web: Organized »

During the early days of the web, when there were only few thousand websites, there was a directory of websites. It was organized along categories and used lots of category analysts to place web sites in the appropriate place in the category tree. You could then browse your way along the category tree to get a list of websites in that sub category. That directory was Yahoo.

We all know what happened next. The web exploded with thousands if not millions of people creating websites, making a manually created directory unsustainable. Enter Search to the rescue. Google, with its smart ranking algorithms enabled us to find the pages we were looking for among what has now become billions of websites.

However, the web world has now grown so large that searches for common topics like “Obama Health Care” or “Ted Kennedy” return results in the millions. So how does one make sense of all these millions of results and find useful information that we can understand? To use a common metaphor, the web has become like a teenager’s stuffed closet. I think its now time for some house cleaning and organization.

How would we go about organizing the web?

  1. The first step in any house cleaning is throwing all the junk away. This is an interesting problem in itself, as one person’s junk may be another’s treasure. Here is where a combination of editors, user preferences and statistics may guide us in figuring out what to keep.
  2. The second step is to bring some sort of overall organization, given the space available. While disk space may be ubiquitous, what I mean here is the space in a users screen or on their mental shelf. So some sort of logical organization similar to what we see in libraries may be warranted. Additionally, we should definitely consider personalizing the layout of this library to a particular user’s interest and needs.
  3. The third step is to make things easily findable. Here we can reintroduce search, but rather than searching the entire web, we only search the organized part of the web that we created in step 2.

Are there companies that are attempting to do this? Sure!! We have efforts from Google (iGoogle), Yahoo(myYahoo) that allow you to customize a homepage with a list of widgets/applications, tune the search results based on your search history (and the feedback of users) to name a few. Kosmix attempts to give users that essential overview about any topic by presenting information from the web in an easy to grasp fashion.

But in my opinion, no one has taken a step back and looked at this problem more holistically and tried to change the paradigm. I am hoping someone does, or we will all soon drown in this tidal wave of information. If you know of companies or sites that are working on this or if you disagree with my argument for a house cleaning, feel to chime in!

sailesh
September 28, 2009

Changing face of Web Search »

Last week Yahoo unveiled a new search interface after bucket testing it for a while. On the surface the changes might seem minimal and for a vast majority of search queries it will seem so. But for a significant volume of queries especially ones we call “topical” in search parlance the interface offers something wholly new and refreshing.

Danny Sullivan at SearchEngineLand does a yeoman job of listing all the new features. While he likes the new interface he doesn’t think it will translate into higher market share for Yahoo.

Here at Kosmix, we see real value in offering users vast breadth and depth of information for topical searches. Take a popular query like how to make sushi, which was Danny’s example.  We offer videos,  images,  guides, howtos, cookbooks,  link to history of sushi,  Martha Stewarts adventures with sushi, news and blogs on sushi, celebrity take on sushi, topics related to sushi and much more … all in one page and under half a second. We do this by intelligently searching the web in real time for the best content on a topic and offering it to you in easily digestible magazine format.

With Yahoo’s makeover and their bold re branding effort more users will be exposed to the new interface. Time will tell if they like what they see.  Regardless of who wins and loses in the search market place this continuing trend of richer search interfaces is a big win for consumers.  What do you think?

manyam
September 25, 2009

Online Ad Neworks and Measuring Brand “Lift” »

Jeremy Liew, the managing director of Lightspeed Venture Partners, one of Kosmix’s key investors, has an interesting post on SeekingAlpha today: In Search of the Next Ad Network Breakthrough.

In the piece, Jeremy argues that the next wave in ad network performance will be driven by two things:  advances in data aggregation and improvements in ad inventory.  He notes that as the big brands move their marketing dollars online,  ad networks will need a better way to demonstrate the campaign’s impact on the brand itself.

But what’s the best way to do that?  Brand “lift” is a notoriously difficult thing to quantify.  Jeremy points to Facebook’s partnership with Nielesen as an example of how online sales teams might go about tracking brand metrics.

Any other ideas out there?  What other startups do you see innovating in this area?

jodi
September 17, 2009

Quick thoughts on Bing Visual Search »

Here at Kosmix, we try to watch new developments in search technology hawkishly. It was with great excitement, then, that I tried Bing Visual Search as soon as I heard about it.

Have you heard about it yet? Bing Visual Search was recently announced by Microsoft as a way to navigate search queries and topics that are best represented visually. For example, if you’d like to see several bird species aggregated and presented to you in a gallery format, Bing’s your guy.

After trying it, I’m happy to assure you that Visual Search is worth the root canal that is the Silverlight install (yes, Visual Search is built using Microsoft’s Silverlight, and yes, it does take five agonizing minutes to download and install on my Mac).

More importantly, once you get over the initial hump, you’ll notice that Bing does a great job of organizing categories of images into galleries (politicians, birds, entertainers, you name it). One can visualize these galleries being incorporated into the main line search results for general topic queries like “Ford Mustang models”. The gallery, in turn, would show you Mustang models by year since launch – cool, no?!

The larger point, however, remains that the entire industry is slowly but surely dipping its toes into presentation of visually rich information in search results. Here at Kosmix, we’ve always treated rich media – images, videos, even applications like BodyMaps – as first class citizens. Its nice to see our overall approach validated with the launch of apps like Visual Search.

We’ll continue to play with Bing Visual Search and post updates. Let us know if you found something interesting on there that you’d like to share!

saumil
September 16, 2009

Kosmix is hiring! »

Yes, I think that statement is absolutely exclamation point worthy. After the year everyone’s had, it’s nice to feel like the skies are parting and there are jobs to be found.

We have a number of positions available in our engineering department, from entry level for those just out of school to more senior positions requiring 10+ years of experience. We’re looking for superstars in the world of Categorization, Release, Information Retrieval, Systems Engineering and Relevance Architecture. If you love an intellectual challenge and have what it takes to thrive in an energetic, fast-paced environment, contact us at http://www.kosmix.com/corp/jobs.

Current Openings:

Product Analyst
Assist the Kosmix ContextLinks team in collecting and analyzing data to improve the relevance and user experience of the product. You will work closely with developers and product managers to identify product problems and areas for product improvement. Great position for recent college graduates.

Member of Technical Staff- Systems
Design, implement, and deploy high-performance, scalable systems and algorithms for massive data storage and distributed processing.

Member of Technical Staff- Categorization
Be a key part of building the world’s best Semantic Categorization platform. Design and implement data pipeline and tools to extract structured information from semi-structured and unstructured sources. The job requires a unique combination of Systems, Data Semantics, and Web Tools.

Sr Support/ Release Engineer
We are looking for an awesome engineer to manage and support Kosmix’s production sites. You will be responsible for the availability and performance of our high traffic sites. A key element of the role is diagnosing and resolving production software issues, requiring you to develop an in-depth understanding of Kosmix’s application architecture and work closely with our developers.

Member of Technical Staff- Information Retrieval/ Categorization
Apply a strong combination of interest and experience in consumer applications, algorithms, and systems to analyze, design and build the core of Kosmix’s Categorization and Topic Engines. The position offers a breadth of challenges involving consumer product and scalable systems. This is not just a classic algorithms position; it requires a passion for consumer experience, a willingness to go the last mile, and an attitude of doing what it takes!

Relevance Architect
This is a senior position with similar requirements to the Information Retrieval/ Categorization position above. Prior Experience in Search/Relevance/Machine Learning and designing large-scale architecture are a must.

Life at Kosmix
We love what we do here at Kosmix so we work hard, but also find time for fun. Ping pong tournaments, scooter races, trivia contests and laser tag are the norm. We eat lunch together every Friday and at least once a month we have cocktails together.

Benefits include medical, dental and vision with no premium for employees, spouses/ domestic partners, and dependents. We also provide life insurance for employees and the option to participate in a 401(k) plan managed by Fidelity. Kosmix offers subsidized commuter passes for those who take the train to work, or we’ll pay you to ride your bike. Employees who have been with the company for three years or less get 15 vacation days. We also have 11 observed holidays and one floating holiday (sick days taken when needed). We are headquartered in Mountain View and have a small office in San Francisco.

barbara
September 14, 2009

Why Wikipedia Can Make a Giant Leap Ahead for the Semantic Web »

A Giant Leap

Every time I want to look up facts, read about a topic, or am curious I go to Wikipedia. So does everyone else. Wikipedia is a brilliant idea for the greatest compendium of knowledge ever created. Unfortunately, for a machine it is a blob of data with very little meaning… all that information is too hard to understand.

Semantic Web is an idea that has been alive for a while. The Internet has revolutionized our access to information. If computers could access that information and process it, imagine how much more we could do with it. But computers are not as smart as humans and we need to talk to them in a different language. We need Web pages and blogs and news and email to be written in this language that computers can understand, a language far simpler and more structured than English. The Semantic Web’s goal is to create this language and make it possible for people who create content to use it everyday.

When did you last learn a new language? Can your neighborhood blogger read French? Or Chinese? New languages are hard to learn and content creators need a BIG incentive to master and use them. Google, Bing, or Kosmix could, for example, provide that incentive by ranking Semantic Web Pages higher, as they can understand them better.

All right, so we know why this language is needed and we have some idea of the incentives that will push people to learn it. But where is this language? Who defines it? The World Wide Web Consortium (W3C) has been defining standards like “RDF” and “OWL” that they hope will lead to this global language. But these standards are really an “Alphabet” that tell us how to write this language. What they don’t give us is a dictionary, a vocabulary called a “Schema” that will help computers understand this language.

What is needed and what the proponents of the Semantic Web have failed to create is this global “Schema” of types of things and their relationships. A vocabulary that works for all the content on the Web. Something that tells a program that an iPhone is a “Mobile Phone” which is a “Phone” which is a “Communication Device” and also a “Personal Electronics Gadget”. A “Schema” is easily extendable and the extensions are easily standardized. There were no “eBook Readers” a few years ago. Kindle is an “eBook Reader.” Both my computer and yours should call it an “eBook Reader.”

Who can create such a language? They would need the largest compendium of information ever created. They would need an easy way for the world to edit and change this compendium over time. They would need a process by which every piece of information in the compendium can be “defined” by a common schema. You see where I’m going here: They would need to be Wikipedia.

Frankly, this is an opportunity that Wikipedia has missed. Don’t get me wrong, Wikipedia has a lot of structure for humans, and lots of companies and researchers are writing sophisticated programs to understand this structure. But this structure isn’t yet complete or visible enough to be used by other content creators. If Wikipedia can evolve into a compendium of information that can also create and maintain this vocabulary, we can have another revolution with Wikipedia at the center. The amazing thing is, this is only a small step from where Wikipedia is today.

At Kosmix, we write sophisticated programs that understand pages on the Web, including Wikipedia. We want our programs to understand what people are writing so we can connect that information to those looking for it. But it would take decades for computing power and technology to grow enough to truly understand the English language. Another revolution in Wikipedia can skip the world ahead by a few decades.

digvijay
September 11, 2009

Friday Fun at Kosmix: Vermiculture (Yep, that means worms!) »

Adding worms to my first worm bin!

Adding worms to my first worm bin!

Every Friday at Kosmix all 60 of us sit down to have lunch together, and someone from the company gives a presentation.  The talk can be about anything—people share their projects, hobbies, work and adventures.  We’ve had discussions about everything from beekeeping to the physics of absolute zero to volunteer projects in African orphanages.  The folks at Komix are an eclectic bunch, and the sessions are always entertaining.  So today, I decided to haul in my Rubbermaid bin and talk about a project I’ve been experimenting with for the last few months – vermicomposting!

In case you didn’t know, compost is a fertilizer material that you can make from everyday food scraps. Adding it to potting soil gives a lot of nutrients to your plants, which is great especially if you have a garden! Vermicomposting is where you create compost by using worms to break down the material, and has a lot of benefits over the traditional compost pile method. The biggest one for me was time – cold compost piles take a long time before you can harvest the compost (anywhere from 13 to 18 months) whereas vermicomposting takes around 3 – 6 months. Another is location convenience, since I can put my worm bin in the garage where it’s easy for me to access. Plus in addition to reducing my food waste, I like to think of it as an extremely cool science experiment!

So what can you compost? You can compost a variety of “green” materials such as fruit/vegetable scraps, egg shells, tea leaves and coffee grounds. You also should mix in “brown” materials such as newspapers, junk mail (not glossy!), dead leaves, and cardboard. You can’t compost meat/fat, oily foods, dairy, or grains – the worms can’t/won’t break these down, although in the case of grains I hear it varies by type.

Set-up is actually really easy. All you need is two bins (one with holes drilled all around it) that fit inside each other, bedding material (shredded brown materials as listed above) and of course worms! Moisten the bedding until it’s like a wrung out sponge, and place about 4 inches worth inside the bin with holes in it. Add worms and add another couple inches of bedding on top. Voila! Now whenever you want to “feed” your worms, you pull back that upper layer of bedding, place your scraps, and cover it up again. You can harvest your compost when most of the food/bedding is gone, and has been replaced with a brown coffee-ground like material. An easy way to harvest is to set up another bin with holes in it, place that on top of the old bin and put all future scraps in that one. Then the worms will slowly migrate over through the holes in the new bin.

I hope this post intrigues a few minds and starts some new worm bins! In case you are interested in learning more, here are a few sites that have great articles and more details on how to get started.

Please feel free to ask questions and share your experiences in the comments!

christine
September 10, 2009

Girls in Tech Comes to Silicon Valley »

Finally, those of us in the South Bay now have a Girls in Tech chapter to call our very own.

This evening several of my Kosmix colleagues and I attended the first meeting of the new Silicon Valley chapter of Girls in Tech. A packed crowd—with nearly as many men as women!—came to the event to learn how to take their iPhone apps from concept to launch. Here are some of the highlights:

***

One of the new chapter’s organizers, Dhana Pawar, a leading mobile application development expert, opened the session by outlining the basic steps for successfully launching an iPhone app:

1) Get a developer account from Apple
2) Choose a great design and development partner
3) Use the ad-hoc deployment process. Keep it simple. Start small, gauge the reaction of a focus group, and iterate from there.
4) Submit to Apple for approval. It takes anywhere from two to three weeks, so build the extra time into your launch plan.
5) Invest in marketing and PR. Social media can be very powerful here.

***

Next, Suzanne Ginsburg of Ginsburg Design shared her advice for user-centered iPhone app design. She’s working on a book about this topic, which should be coming out in June 2010. Here are the five most common pitfalls that she sees iPhone developers make:

1) iPhone apps that are too complicated and unfriendly to set up

2) Tasks that require too many steps or too much typing

3) The inability to synch the iPhone app with a desktop or Web version

4) The app doesn’t remember where the user left off

5) No content for a given location, even though the app bills itself as “national”

Suzanne then offered three simple tips for creating a successful iPhone app:

1) Conduct upfront user research to understand usability and discover new opportunities. Methods include shadowing, field interviews and diary studies.

2) Brainstorm and sketch like mad. Apple’s Human Interface Guidelines are a good place to start, but try to see beyond the basic frameworks.

3) Refine and Test Promising Directions. Usability testing your concepts will help uncover issues related to setup, flows, terminology and more.


***

The final presentation was by AdMob’s Mike Fyall, an expert in promoting and monetizing mobile applications through advertising. His company recently conducted research about iPhone user behavior, and came up with some interesting findings:

1) The average user downloads 10 apps per month, and the average iPod Touch user downloads nearly three times that amount.

2) More than twice as many iPhone and iPod Touch users have adopted paid apps as Android users.

3) Fifty percent of iPhone users buy paid apps, with an average of 1-2 paid apps each.

4) Users most often discover apps by browsing the AppStore and searching directly.

5) Over 90% of people download apps on their phone, rather than from their computers.

At Kosmix, we have two iPhone apps: MeeHive and Samachar News. Check them out at the App Store and let us know how well we followed the experts’ advice!

jodi