views:

104

answers:

1

Hello,

I am trying to determine document similarity between a single document and each of a large number of documents (n ~= 1 million) as quickly as possible. More specifically, the documents I'm comparing are e-mails; they are grouped (i.e., there are folders or tags) and I'd like to determine which group is most appropriate for a new e-mail. Fast performance is critical.

My a priori assumption is that the cosine similarity between term vectors is appropriate for this application; please comment on whether this is a good measure to use or not!

I have already taken into account the following possibilities for speeding up performance:

  1. Pre-normalize all the term vectors

  2. Calculate a term vector for each group (n ~= 10,000) rather than each e-mail (n ~= 1,000,000); this would probably be acceptable for my application, but if you can think of a reason not to do it, let me know!

I have a few questions:

  1. If a new e-mail has a new term never before seen in any of the previous e-mails, does that mean I need to re-compute all of my term vectors? This seems expensive.

  2. Is there some clever way to only consider vectors which are likely to be close to the query document?

  3. Is there some way to be more frugal about the amount of memory I'm using for all these vectors?

Thanks!

+3  A: 

Use Bayesian filtering. The link provided refers to spam filtering, but you can adapt the algorithm pretty easily to multiple categories/tags.

There are lots of good SO question about Bayesian filtering, too.

JSBangs
Thanks for the recommendation. Bayesian filtering is an interesting idea. I have a few questions: 1. Why do you think BF is better than cosine similarity for this instance? 2. I might be missing something, but won't my categorization time still be O(n) for n=number of categories, just as it would for cosine similarity? I think what I need is an O(log n) or O(1) lookup table to rank probable categorization candidates.
Peyton