views:

28

answers:

1

Hi all, I am using Luke to view a Lucene index. There is a column named 'Rank'. What is the actual meaning of it? My guess is that the Rank means number of occurrence and the larger Rank number meaning the term is more significant. But I don't understand is that it is a full text search. If I search for 'apple', all the 'apple' index will be returned that doesn't matter with what Rank 'apple' has. Am I having a wrong understanding? If not, what is the actual use for the Rank column?

When I inspect the index, it seems there are quite some 'noise' there, e.g. the character 'o' has a very high Rank number. Does it mean this index is bad? How should I fix it? Thanks in advance.

+1  A: 

'Rank' is the frequency of a term within a field. It does not mean it is more significant. In fact, the least frequent terms are often the most significant of an index. But knowing the most frequent terms of your index is sometimes important for analysis or debug purpose (see this question for example).

The fact that you have a lot of terms like 'o' does not mean your index is bad. Check the tokenizer and analyzer used for indexing. Some tokenizer strips words on punctuation mark. Some analyzers will stem words and often, it will yield single letter terms. There are a lot of reasons that can explain the presence of single letter terms.

If you see a lot of undesirable terms in your index, you might consider using a stop words filter at index time. Lucene provides functionalities for this.

Pascal Dimassimo