views:

75

answers:

2

I know the title might suggest it is a duplicate but I haven't been able to find the answer to this specific issue:

I have to filter search results based on a date range. Date of each document is stored (but not indexed) on each one. When using a Filter I noticed the filter is called with all the documents in the index.

This means the filter will get slower as the index grows (currently only ~300,000 documents in it) as it has to iterate through every single document.

I can't using RangeQuery since the date is not indexed.

How can I apply the filter AFTER only on the documents that are the results of the query to make it more efficient?

I prefer to do it before I am handed the results not to mess up the scores and collectors I have.

+1  A: 

First, to filter on a field, it has to be indexed.

Second, using a Filter is considered to be the best way to restrict the set of document to search on. One reason for this is that you can cache the filter results to be used for other queries. And the filter data structure is pretty efficient: it is a bit set of documents matching the filter.

But if you insist on not using filters, I think the only way is to use a boolean query to do the filtering.

Pascal Dimassimo
Is it a bit set for documents matching the filter (which means you have to yield all documents) or a bit set for the Terms matching the filter? I guess caching would be possible if it was on Terms.
Khash
It is a bit set of documents matching the filter. It allows to search on the same subset of documents for another query when the same filter is used.
Pascal Dimassimo
+2  A: 

Not quite sure if this will help, but I had a similar problem to yours and came up with the following (+ notes):

  1. I think you're really going to have to index the date field. Nothing else makes any sense in terms of querying/filtering etc.
  2. In Lucene.net v2.9, range querying where there are lots of terms seems to have got terribly slow compared to v2.9
  3. I fixed my speed issues when using date fields by switching to using a numeric field and numeric field queries. This actually gave me quite a speed boost over my Lucene.net v2.4 baseline.
  4. Wrapping your query in a caching wrapper filter means you can hang onto the document bit set for the filter. This will also dramatically speed up subsequent queries using the same filter.
  5. A filter doesn't play a part in the scoring for a set of query results
  6. Joining your cached filter to the rest of your query (where I guess you've got your custom scores and collectors) means it should meet the final part of your criteria

So, to summarise: index your date fields as numeric fields; build your queries as numeric range queries; transform these into cached filter wrappers and hang onto them.

I think you'll see some spectacular speedups over your current index usage.

Good luck!

p.s. I would never second guess what'll be fast or slow when using Lucene. I've always been surprised in both directions!

Moleski