views:

318

answers:

2

What is the idiomatic way to delete old documents from a Lucene Index?

I have a date field (YYYYMMddhhmmss) on all of the documents, and I'd like to remove anything more than a day old (for example).

Should I perform a filtered search or enumerate through the IndexReader's documents?

I'm sure the question is the same regardless of which platform Lucene is running on.

Thanks!

+3  A: 

Searching for YYYYMMdd* should work as currently dates are stored as text strings. Once you have the results, you could use IndexReader.delete to remove the docs you're not interested in. That seems to me the best way to achieve this.

synhershko
One problem I see with that approach is that I'll get a "TooManyClauses" exception when there are more than old 1024 documents.
Eric Nicholson
This really depends on your implementation. I will need to know the specifics, but as a general rule you could either remove this warning for those searches since they are maintenance only anyway (by setting a higher max clauses count), or make more specific searches (YYMMddhh* etc.). Again, all depends on your environment and implementation.
synhershko
I ended up doing a slight variation of this, by using a MatchAllDocsQuery and a RangeFilter. Seems to be working OK so far...
Eric Nicholson
+2  A: 

You could try using low-level APIs of Lucene.

Get Term Enumerator from index with the term "YYYY". Iterate of the term enumerator to get terms. If the term's text doesn't with current date (or previous date), call IndexReader.deleteDocuments(term) with that term.

Since you are not using Query object, you will not get search related exception.

Shashikant Kore