questions about tfidf

tf-idf and previously unseen terms

TF-IDF (term frequency - inverse document frequency) is a staple of information retrieval. It's not a proper model though, and it seems to break down when new terms are introduced into the corpus. How do people handle it when queries or new documents have new terms, especially if they are high frequency. Under traditional cosine match...

algorithm

statistics

nlp

natural-language

tfidf

How to calculate IDF?

Thank you guys on this website you helped in TF/IDF. It helped me alot to make tf-idf function in java. I made tf but I have one question. As on wiki they wrote IDF can be calculated that how many documents have the term. But I am confused. For example, Here is the string "JosAH is great. JoshAH rocks" so the TF would be 2/5 and for IDF...

homework

tfidf

tf idf similarity problem

Hi All, i am using TF/IDF to calculate similarity. For example if i have following two doc. Doc A => cat dog Doc B => dog sparrow It is normal it's similarity would be 50% but when I calculate its TF/IDF. It is as follow Tf values for Doc A dog tf = 0.5 cat tf = 0.5 Tf values for Doc B dog tf = 0.5 sparrow tf = 0.5 IDF values for D...

Lucene numDocs and doqFreq on custom similarity class

Hi All, im doing an aplication with Lucene (im a noob with it) and im facing some problems. My aplication uses the Lucene 2.4.0 library with a custom similaraty implementation (the jar is imported) In my app im calculating doqFreq and numDocs manually (im adding the values of all indexes and then i calculate a global value in order to u...

lucene

similarity

tfidf

tf-idf: Does using it help to weigh documents that share the terms higher than a document that doesnt?

Hi. I'm working on a customized search feature for a website. and I was curious if using only tf-idf to rank documents in my corpus would also help to weigh documents that have multiple search terms higher than documents with only one search term. Example: Search = "poland spring water" Theoretically, would the above query weigh (u...

search

tfidf

how do I normalise a solr/lucene score?

Hi, I am trying to work out how to improve the scoring of solr search results. My application needs to take the score from the solr results and display a number of “stars” depending on how good the result(s) are to the query. 5 Stars = almost/exact down to 0 stars meaning not matching the search very well, e.g. only one element hits. ...