views:

76

answers:

4

Hi all!

hmmm, i need to get how important is the word in entire document collection that is indexed in the lucene index. I need to extract some "representable words", lets say concepts that are common and can be representable to whole collection. Or collection "keywords". I did the fulltext indexing and the only field i am using are text contents, because titles of the documents are mostly not representable(numbers, codes etc....)

EDIT: I am reading the index which contains maybe 60 documents....

 int numDocs = fReader.numDocs();
 while(termEnum.next())
    {
        Term term = termEnum.term();
        double df = fReader.docFreq(term); 

       TermDocs termDocs = indexReader.termDocs(term);

    //HERE is what i mean when i say tfidf is per document,

             while(termDocs.next())
            {
               double tf = termDocs.freq();
               // Calculate tfidf.......
            }

            termDocs.close();

}

So, I will get tfidf of this term, but for every document that we loop through. And I do not need these results:

tfidf(term1, doc1);

tfidf(term1, doc2);

tfidf(term1, doc3); ...........and so on.
I need some measure of importance of this term in the collection. By intuition, it would be something like "if term "term1" had good tfidf in 5 documents then it is important"

But ofcourse, something smarter :)

Thank you!!!

A: 

You can try opening the index using Luke and it gives you the top-ranked terms.

Mikos
@Mikos, those terms i need for my algorithm to do some analysis. So seeing them in Luke doesnt help. I need to implement that. But nevertheless, I am not sure you understan what i ask. Even in the case I do not need those terms on runtime, i think top-terms in Luke are not what I need. Do you know what similarity function Luke is using to retrieve those topterms???? If it is just frequency in index(i think it is), that does not help in my case at all......... :/
Julia
@Julia, you should elucidate your requirements further. Sorry that I didn't get it, but suspect would be an issue with others too, so please explain your business-case further and I'll try to help. :-)
Mikos
@Mikos: I did edit with better explanation!
Julia
A: 

EDIT: I still do not get what you are trying to achieve. A high TF/IDF value means that this term is useful for differentiating this document from the rest of the collection, that is: this term is relatively more frequent in the specific document than in the collection in general. Therefore it "represents" the document against the collection background. Is this what you want?

One possible way to rephrase your question is that you want to compress the collection, using few high-frequency terms. This means words that appear a lot in the collection, and can be done by take words having low idf.

Another alternative is that you want some concise way to represent the collection against a more general background, say a larger collection or the whole WWW. In that case, you want to compare word frequency between collections, consider the mutual information between the word type and the collection, or other feature selection methods.

If I still miss your point, please say so.

Yuval F
@Yuval F: I did edit with better explanation!
Julia
@Julia: I edited my answer. Hope it is clearer and to the point.
Yuval F
A: 

So, if i calculate tfidf, it gives me importance of single term with respect to single document.

Not true. IDF is measured globally across the entire corpus. The whole point of IDF is to provide a simple measure of exactly what you're looking for -- how "important" a term is.

So an easy way of doing what you ask is to find the most frequently occurring terms in the corpus, and weight them by document frequency.

bajafresh4life
A: 

The contrib/ folder has a class to generate a list of the most frequent terms: http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/contrib/misc/src/java/org/apache/lucene/misc/HighFreqTerms.java

If you're instead looking for semantic feature extraction, you can check out http://project.carrot2.org/

Xodarap