views:

658

answers:

1

Hi All, Thank you all great guys here for helping people like me :) I just need small hint ....

I calculated tf/idf values of two documents. Following is the tf/idf values 1.txt 0.0 0.5 2.txt 0.0 0.5

The documents are like 1.txt = > dog cat 2.txt = > cat elephant

As now I have tf/idf values. Can any body tell me how to use these values to calculate cosine similarity??

I already read wikipedia and all other tutorial that i should calculate dot product then find distance then divide dot product by distance. I am not good in math. That's why I couldn't understand what they are doing with X,Y :)

If u can just tell me how to calculate using my values. I will understand and implement it.

One more question. In is important both documents should have same number of words?

Thanks !

+4  A: 
            a * b
sim(a,b) =--------
           |a|*|b|

a*b is dot product

some details:

def dot(a,b):
  n = length(a)
  sum = 0
  for i in xrange(n):
    sum += a[i] * b[i];
  return sum

def norm(a):
  n = length(a)
  for i in xrange(n):
    sum += a[i] * a[i]
  return math.sqrt(sum)

def cossim(a,b):
  return dot(a,b) / (norm(a) * norm(b))

yes. to some extent, a and b must have the same length. but a and b usually have sparse representation, you only need to store non-zero entries and you can calculate norm and dot more fast.

Yin Zhu
Thanks, But I am also confused about one more thing. I saw people talking about this over net. I couldn't understand. Should I calculate cosine similarity on tf/idf values OR. Only idf values or only tf values?????I know php and start learning java. But i am sorry i don't know which lang. code you used here? can you please let me know, i will look that lang. basic syntax. Or if you can use my tf/idf values to calculate cosine similarity, it will show me how to write a function for that...thanks again for reply!
@agazerboy the sample is given in python, which should be quite readable. for i in xrange(n) means for (i=0; i<n; i++).you should calculate on tf-idf values, sometimes you can also use tf.
Yin Zhu
please read my explaination below !