views:

55

answers:

1

I have a large collection of documents that already have their TF-IDF computed. I'm getting ready to add some more documents to the collection, and I am wondering if there is a way to add TF-IDF scores to the new documents without re-processing the entire database?

+2  A: 

Basically there are two options:

  1. Compute your tf-idf scores only when you need them. Adding a new document is now trivial. All you'll have to do is to update the number of all documents, the number of documents in which a token occurs and to store the token occurence vector for the new document.

  2. Periodically recalc your tf-idf vectors, maybe after adding 100K documents or something like that. In between, just work with the old values (number of all documents, number of documents a token occurs in).

If your collection is really large, you'll probably want to take the second approach, because new documents won't change the global distribution of words much anyway. That said, it's better to test both methods and settle for the one that fits your problem best.

ephes
If you take option 2, would you not leave new never-before observed tokens out? Couldn't that be bad for recall?
johanbev