ansaurus

Question

Answer 1

+4 A:

Let's see if I get your question: You want to calculate the TF/IDF similarity between the two documents:

Doc A: cat dog

and

Doc B: dog sparrow

I take it that this is your whole corpus. Therefore |D| = 2 Tfs are indeed 0.5 for all words. To calculate the IDF of 'dog', take log(|D|/|d:dog in d| = log(2/2) = 0 Similarly, the IDFs of 'cat' and 'sparrow' are log(2/1) = log(2) =1 (I use 2 as the log base to make this easier).

Therefore, the TF/IDF values for 'dog' will be 0.5*0 = 0 the TF/IDF value for 'cat' and 'sparrow' will be 0.5*1 = 0.5

To measure the similarity between the two documents, you should calculate the cosine between the vectors in the (cat, sparrow, dog) space: (0.5, 0 , 0) and (0, 0.5, 0) and get the result 0.

To sum it up:

You have an error in the IDF calculations.
This error creates wrong TF/IDF values.
The Wikipedia article does not explain the use of TF/IDF for similarity well enough. I like Manning, Raghavan & Schütze's explanation much better.

Yuval F 2009-12-31 20:46:32

Thanks Yuval ! ! ! You made my life easy :)There are two problems 1 that i was using natural log. I couldn't find any log2 function in java but i will figure it out.The 2nd problem is more important. I couldn't understand how are you meassuring similarity with cosine? When tf/idf said 50% similarity then why cosine is saying 0% ???

2009-12-31 21:39:23

You're welcome. I believe using natural log is better, it was just easier to explain using base 2. Let's clarify the cosine similarity:TF/IDF is purely a representation: You convert a vector of word counts to a vector of TF/IDF values. The cosine similarity is the scalar multiplication between two normalized vectors; The vectors can be the original counts or transformed by TF/IDF. In the case as you stated it, the scalar multiplication will be zero because we either have words appearing in only one vector, or a common word with a zero score ('dog'). HTH.

Yuval F 2010-01-01 10:53:40

Thanks Yuval, If I use natural log then my Tf/Idf values are different then yours. If i use log2 then i think i get correct results. Can you please tell me what is the difference between LSI and vector space? Sorry it sounds dumb question. If you can send me a good tutorial how to implement LSI. it would be great help

2010-01-01 23:31:17

This is by no means a dumb question. Informally, LSI is a way to weight term frequency vectors that uses more information from the term-document matrix than TF/IDF does, via a singular value decomposition (SVD). I suggest you read this: http://sujitpal.blogspot.com/2008/09/ir-math-with-java-tf-idf-and-lsi.html for a theoretical explanation and an implementation guide.

Yuval F 2010-01-03 10:02:56

Tomáš Kafka 2010-10-14 12:20:29

Answer 2

A:

I think you have to take ln instead of log.

Toqir 2010-01-03 16:10:39

ansaurus

tags:

views:

answers:

tf idf similarity problem

related questions