views:

50

answers:

0

Is it possible to perform document similarity search efficiently using sphinx search? My index consists of 500k documents, each which is tagged by 5-30 different short, all lowercase stemmed words which is the data to search through. For simplicity, all tags in the database has equal weights and I'm not using phrase searching. My first attempt was to "or" all terms in the target document and use that as the query to find similar documents:

/app/sphinx/bin/search -l 10 -a "term1 term2 .. termN"

This approach works very well, but is too slow. The query completes in ~400ms but because I need to run it in real time, I want the query to take less than 200ms. So next I tried to combine and and or:ed terms to reduce the number of documents sphinx has to rank. The most important terms are and:ed and the not so important or:ed

/app/sphinx/bin/search -l 10 -b 'termA & termB & termC & (term1|term2|...|termN)'

It is perfect, but unfortunately sphinx doesn't rank results of boolean queries, which makes it useless as I want to find the closest matches not any document that happens to match.

So can anyone think of a better way to do it? I want to avoid precalculating the similarities because the index changes often and I also want to personalize the recommendations.