ansaurus

Question

Answer 1

A:

You can try opening the index using Luke and it gives you the top-ranked terms.

Mikos 2010-07-23 06:00:41

@Mikos, those terms i need for my algorithm to do some analysis. So seeing them in Luke doesnt help. I need to implement that. But nevertheless, I am not sure you understan what i ask. Even in the case I do not need those terms on runtime, i think top-terms in Luke are not what I need. Do you know what similarity function Luke is using to retrieve those topterms???? If it is just frequency in index(i think it is), that does not help in my case at all......... :/

Julia 2010-07-23 07:13:17

@Julia, you should elucidate your requirements further. Sorry that I didn't get it, but suspect would be an issue with others too, so please explain your business-case further and I'll try to help. :-)

Mikos 2010-07-23 20:04:09

@Mikos: I did edit with better explanation!

Julia 2010-07-26 03:39:45

Answer 2

A:

EDIT: I still do not get what you are trying to achieve. A high TF/IDF value means that this term is useful for differentiating this document from the rest of the collection, that is: this term is relatively more frequent in the specific document than in the collection in general. Therefore it "represents" the document against the collection background. Is this what you want?

One possible way to rephrase your question is that you want to compress the collection, using few high-frequency terms. This means words that appear a lot in the collection, and can be done by take words having low idf.

Another alternative is that you want some concise way to represent the collection against a more general background, say a larger collection or the whole WWW. In that case, you want to compare word frequency between collections, consider the mutual information between the word type and the collection, or other feature selection methods.

If I still miss your point, please say so.

Yuval F 2010-07-25 09:06:00

@Yuval F: I did edit with better explanation!

Julia 2010-07-26 03:40:14

@Julia: I edited my answer. Hope it is clearer and to the point.

Yuval F 2010-07-26 10:28:55

Answer 3

A:

So, if i calculate tfidf, it gives me importance of single term with respect to single document.

Not true. IDF is measured globally across the entire corpus. The whole point of IDF is to provide a simple measure of exactly what you're looking for -- how "important" a term is.

So an easy way of doing what you ask is to find the most frequently occurring terms in the corpus, and weight them by document frequency.

bajafresh4life 2010-07-25 21:41:11

Answer 4

A:

The contrib/ folder has a class to generate a list of the most frequent terms: http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/contrib/misc/src/java/org/apache/lucene/misc/HighFreqTerms.java

If you're instead looking for semantic feature extraction, you can check out http://project.carrot2.org/

Xodarap 2010-07-27 14:31:53

ansaurus

tags:

views:

answers:

Word importance in lucene index

related questions