How do you efficiently implement a document similarity search system?

views:

303

answers:

+1 Q:

How do you efficiently implement a document similarity search system?

How do you implement a "similar items" system for items described by a set of tags?

In my database, I have three tables, Article, ArticleTag and Tag. Each Article is related to a number of Tags via a many-to-many relationship. For each Article i want to find the five most similar articles to implement a "if you like this article you will like these too" system.

I am familiar with Cosine similarity and using that algorithm works very well. But it is way to slow. For each article, I need to iterate over all articles, calculate the cosine similarity for the article pair and then select the five articles with the highest similarity rating.

With 200k articles and 30k tags, it takes me half a minute to calculate the similar articles for a single article. So I need another algorithm that produces roughly as good results as cosine similarity but that can be run in realtime and which does not require me to iterate over the whole document corpus each time.

Maybe someone can suggest an off-the-shelf solution for this? Most of the search engines I looked at does not enable document similarity searching.

+1 A:

Some questions,

How is ArticleTag different from Tag? Or is that the M2M mapping table?
Can you sketch out how you've implemented the cosine matching algorithm?
Why don't you store your document tags in an in memory data structure of some sort, using it only to retrieve document IDs? This way, you only hit the database during retrieval time.
Depending on the freq of document addition, this structure can be designed for fast/slow updates.

Initial intuition towards an answer - I'd say, an online clustering algorithm (perhaps do a Principal Components Analysis on the co-occurrence matrix, which will approximate a K-means cluster?). Better refined once you answer some of those questions above.

Cheers.

viksit 2010-02-04 21:45:45

1. Yes ArticleTag is the M2M table.2. It's implemented like the Wikipedia article prescribes. The tagsfor an article is seen as vectors where the occurence of a tag has thevalue 1 and 0 of it is omitted.3. I do store document ids and its tag ids in memory, which of coursespeeds up the process a lot but it is still to slow.I think k-means clustering may be what I need, but I cant quite seehow to use it for this scenario

Björn Lindqvist 2010-04-08 11:51:33

You can do it with the Lemur toolkit. With its KeyfileIncIndex, you have to re-retrieve the document from its source; the IndriIndex supports retrieving the document from the index.

But anyway, you index your documents, and then you build a query from the document you want to find similar documents to. You can then do a search with that query, and it will score the other documents for similarity. It's pretty fast in my experience. It treats both source documents and basic queries as documents, so finding similarities is really what it does (unless you're using the Indri parser stuff - that's a bit different, and I'm not sure how it works).

Michael E 2010-03-19 01:49:00

That is not different from how normal text search engines work. The essence of the problem is that to find the, say ten best matches, you have to find the similarity score between the search query and each document in the database. Then you have to order them on the similarity score and take the ten best. Because you have to go through the full database for each article you want to find similarities for it becomes way to slow.

Björn Lindqvist 2010-04-08 11:44:30

ansaurus

tags:

views:

answers:

How do you efficiently implement a document similarity search system?

related questions