tags:

views:

32

answers:

2

I want to know the number of terms for each document in a lucene index. I've been searching in API and in internet with no result. Can you help me?

+1  A: 

Lucene is build to answer the opposite question, that is, what documents contain a given term. So in order to get the number of terms for a document, you have to hack a bit.

A first method is to stored the terms vector for each field that you need to be able to retrieve the number of terms. The terms vector is the list of terms of the fields. At search time, you can retrieve it using the getTermFreqVector method of IndexReader (if they were stored at index time). When you have it, you get the length of the vector and you have the number of terms for that field.

Another method, if you have stored the fields of your documents, is to get back the text of those fields and count the number of terms by analyzing it (split the text in words).

Last, if an approximation of the number of terms of a field is enough for you and you stored the norms at index time, there is the option of computing the inverse function of the one used to compute the norms of a field. If you look closely at the method lengthNorm of the Similarity class, you will notice that it uses the number of terms of a field. The result of this method is stored in the index using the encodeNorm method. You can them, at search time, retrieve the norms using the norms method of IndexReader. With the norm in hand, uses the inverse mathematical function of the one used in lengthNorm to get back the number of terms. Like I said, it is only an approximation, because when the norm is stored, some precision is lost and you might not get exactly the same number as what was stored.

Pascal Dimassimo
+1  A: 

This is actually kind of difficult to do in Lucene if you did not store term vectors at index time. Lucene's underlying data structure is an inverted index, which stores terms as keys and document ID lists as values. That's why there isn't a "getNumTerms()" method in the API, because the fundamental data structures that Lucene employs don't support it.

That being said, you can store term vectors in the index, which you can look up by document ID at search time. These vectors are essentially a complete list of all the terms in that document, which you can then count to get your # of terms.

See

http://lucene.apache.org/java/3_0_1/api/all/org/apache/lucene/document/Field.TermVector.html

Alternatively, you can count all the terms beforehand and store it as a field in your index.

bajafresh4life
+1 storing the number of terms at index time is a nice idea
Pascal Dimassimo