tags:

views:

289

answers:

2

Thank you guys on this website you helped in TF/IDF. It helped me alot to make tf-idf function in java. I made tf but I have one question. As on wiki they wrote IDF can be calculated that how many documents have the term. But I am confused.

For example, Here is the string "JosAH is great. JoshAH rocks" so the TF would be 2/5 and for IDF there are 2 documents and each documents contain JoshAH term. So Will we just see if that term occur in other documents or we will see how many times it occurs in other documents?

+1  A: 

I'm not entirely sure what you ask here. Anyway, the purpose of IDF --- inverse document frequency --- is to dampen the score of very frequent terms, and boost the score of infrequent terms.

In your collection of two documents, the IDF of "JosAH" will be 0 --- since it occurs in all documents.

Alex Brasetvik
Thanks Alex, Let me explain my question. In 1 document i can calculate term frequency to see how many times one word occurs. But for IDF, should I see if it occurs in other documents or no. OR Should I also see how many time it occurs in other documents? If you still any question please do ask me. Thanks
The mathematical definition of IDF should be well-defined by your textbook. Quoting Wikipedia: The inverse document frequency is a measure of the general importance of the term (obtained by dividing the number of all documents by the number of documents containing the term, and then taking the logarithm of that quotient).So you need to know the *number of documents* it occurs in, and the *total* number of documents. You do not need the number of occurrences per document, though.
Alex Brasetvik
Lets say some how we calculated TF/IDF and the term is "JosAH" and itstf/idf = 0.232but we want to see the full document similarity with 2nd document so i have to calculate TF/IDF for each term? then sum it to get actual tf/idf ??? if i am wrong then please correct me
A: 

The document frequency is 'the number of documents in the collection that contain a term' (from Introduction to Information Retrieval), so in your words the former option, 'just see if that term occurs'.

Fabian Steeg