views:

373

answers:

2

TF-IDF (term frequency - inverse document frequency) is a staple of information retrieval. It's not a proper model though, and it seems to break down when new terms are introduced into the corpus. How do people handle it when queries or new documents have new terms, especially if they are high frequency. Under traditional cosine matching, those would have no impact on the total match.

+1  A: 

Er, nope, doesn't break down.

Say I have two documents, A "weasel goat" and B "cheese gopher". If we actually represented these as vectors, they might look something like:

A [1,1,0,0]
B [0,0,1,1]

and if we've allocated these vectors in an index file, yeah, we've got a problem when it comes time to add a new term. But the trick of it is, that vector never exists. The key is the inverted index.

As far as new terms not affecting a cosine match, that might be true depending on what you mean. If I search my corpus of (A,B) with the query "marmoset kungfu", neither marmoset nor kungfu exist in the corpus. So the vector representing my query will be orthogonal to all the documents in the collection, and get a bad cosine similarity score. But considering none of the terms match, that seems pretty reasonable.

Jay Kominek
A: 

When you talk about "break down" I think you mean that the new terms have no impact on the similarity measure, because they do not have any representation in the vector space defined by the original vocabulary.

One approach to handle this smoothing problem would be to consider fixing the vocabulary to a smaller vocabulary and treat all words rarer than a certain threshold as belonging to the special _UNKNOWN_ word.

However, I don't think your definition of "break down" is very clear; could you clarify what you mean there? If you could clear that up, perhaps we could discuss ways to work around those problems.

Trochee