views:

99

answers:

2

I'm trying to use TF-IDF to sort documents into categories. I've calculated the tf_idf for some documents, but now when I try to calculate the Cosine Similarity between two of these documents I get a traceback saying:

#len(u)==201, len(v)==246

cosine_distance(u, v)
ValueError: objects are not aligned

#this works though:
cosine_distance(u[:200], v[:200])
>> 0.52230249969265641

Is slicing the vector so that len(u)==len(v) the right approach? I would think that cosine similarity would work with vectors of different lengths.

I'm using this function:

def cosine_distance(u, v):
    """
    Returns the cosine of the angle between vectors v and u. This is equal to
    u.v / |u||v|.
    """
    return numpy.dot(u, v) / (math.sqrt(numpy.dot(u, u)) * math.sqrt(numpy.dot(v, v))) 

Also -- is the order of the tf_idf values in the vectors important? Should they be sorted -- or is it of no importance for this calculation?

+1  A: 

Are you computing the cosine similarity of term vectors? Term vectors should be the same length. If words aren't present in a document then it should have a value of 0 for that term.

I'm not exactly sure what vectors you're applying cosine similarity for but when doing cosine similarity then your vectors should always be the same length and order very much does matter.

Example:

Term | Doc1 | Doc2
Foo     .3     .7
Bar  |  0   |  8
Baz  |  1   |  1

Here you have two vectors (.3,0,1) and (.7,8,1) and can compute the cosine similarity between them. If you compared (.3,1) and (.7,8) you'd be comparing the Doc1 score of Baz against the Doc2 score of Bar which wouldn't make sense.

Pace
The vectors that I'm passing to the cosine_distance function are Python lists of the tf_idf values. v[:5] == [0.0060830126968545294, 0.00048241996565891193, 0.0020712248617478965, 0.0110036199241575, 0.0110036199241575]You say order matters -- what is the correct way to sort the content of the vector (smallest->largest, order of words in document?)
erikcw
You have to assign some global order to words. If bar appears before foo in doc2 then tf_idf value of bar should still be the first item.
Pace
+1  A: 

You need multiply the entries for corresponding words in the vector, so there should be a global order for the words. This means that in theory your vectors should be the same length.

In practice, if one document was seen before the other, words in the second document may have been added to the global order after the first document was seen, so even though the vectors have the same order, the first document may be shorter, since it doesn't have entries for the words that weren't in that vector.

Document 1: The quick brown fox jumped over the lazy dog.

Global order:     The quick brown fox jumped over the lazy dog
Vector for Doc 1:  1    1     1    1     1     1    1   1   1

Document 2: The runner was quick.

Global order:     The quick brown fox jumped over the lazy dog runner was
Vector for Doc 1:  1    1     1    1     1     1    1   1   1
Vector for Doc 2:  1    1     0    0     0     0    0   0   0    1     1

In this case, in theory you need to pad the Document 1 vector with zeroes on the end. In practice, when computing the dot product, you only need to multiply elements up to the end of Vector 1 (since omitting the extra elements of vector 2 and multiplying them by zero are exactly the same, but visiting the extra elements is slower).

Then you can compute the magnitude of each vector separately, and for that the vectors don't need to be of the same length.

Ken Bloom
@Ken: When computing cosine-similarity, the padded zeros do make a difference. The function I'm using for cosine similarity is: cosine-similarity = dot-product(u, v) / sqrt(dot-product(u, u))*sqrt(dot-product(v,v))cosine-similarity([1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,], [1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1]) == 0.3333333333333333cosine-similarity([1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 0, 0, 0, 0, 0, 0, 0]) == 0.4714045207910316Looks like the problem is that the divisor of cosine-similarity performs a dot-product of itself. So the padding has an effect. Do you know of anyway around this?
erikcw
@erikcw: When you compute the dot-products for `Doc1*Doc1` then you need to go to the full length of `Doc1` but no further. For `Doc2*Doc2` then you need to go to the full length of `Doc2` but no further. For that part of the dot product, don't truncate the longer vector, but it's unnecessary to pad the shorter vector. However, when computing `Doc1*Doc2`, you can safely truncate the longer vector. So your code should be designed to have the `dot-product` function make decisions about padding/truncation, rather than having the `cosine-similarity` function make those decisions.
Ken Bloom