ansaurus

Question

Any good way to handling repeats when using Lucene indexing?

Answer 1

+3 A:

Repeated terms may affect the search performance by forcing the scorer to consider a large set of documents. If you have terms that are not that discriminating between documents, I suggest preprocessing the documents in order to remove these terms. However, you may want to start by indexing everything (say for a sample of 10000-20000 documents) and see how you fare with regard to relevance and performance.

From the way you describe this, you will need to index the category, track and keywords fields, maybe using a KeywordAnalyzer for the category and track fields. You only need to store the id field. You may want a custom analyzer for the keywords field, or alternatively to preprocess it before the actual indexing.

Yuval F 2010-07-12 13:20:15

+1 for try indexing everything first, then optimize later. 2GB is not that much data, and Lucene is pretty fast

bajafresh4life 2010-07-12 13:37:14

+1 and I second bajafresh4life's comment

Pascal Dimassimo 2010-07-12 15:25:18

ansaurus

tags:

views:

answers:

Any good way to handling repeats when using Lucene indexing?

related questions