I'm trying to use TF-IDF to sort documents into categories. I've calculated the tf_idf for some documents, but now when I try to calculate the Cosine Similarity between two of these documents I get a traceback saying:
#len(u)==201, len(v)==246
cosine_distance(u, v)
ValueError: objects are not aligned
#this works though:
cosine_distance(u[:200], v[:200])
>> 0.52230249969265641
Is slicing the vector so that len(u)==len(v) the right approach? I would think that cosine similarity would work with vectors of different lengths.
I'm using this function:
def cosine_distance(u, v):
"""
Returns the cosine of the angle between vectors v and u. This is equal to
u.v / |u||v|.
"""
return numpy.dot(u, v) / (math.sqrt(numpy.dot(u, u)) * math.sqrt(numpy.dot(v, v)))
Also -- is the order of the tf_idf values in the vectors important? Should they be sorted -- or is it of no importance for this calculation?