tags:

views:

771

answers:

1

Hi all, We have set up an Solr index containing 36 million documents (~1K-2K each) and we try to query a maximum of 100 documents matching a single simple keyword. This works pretty fast as we had hoped for. However, if we now add "&sort=createDate+desc" to the query (thus asking for the top 100 'new' documents matching the query) it runs for a long, very long time and finally results in an OutOfMemoryException. From what I've understood from the manual this is caused by the fact that Lucene needs to load all the distinct values for this field (createDate) into memory (the FieldCache afaik) before it can execute the query. As the createDate field contains date and time the number of distinct values is pretty large. Also important to mention is that we frequently update the index.

Perhaps someone can provide some insights and directions on how we can tune Lucene / Solr or change our approach in such a way that query times become acceptable? Your input will be much appreciated! Thanks.

+1  A: 

The problem is Lucene stores numbers as Strings. There are some utilities, which split the date into YYYY, MM, DD and put them in different fields. That gives much better results.

Newer version of Lucene (2.9 onwards) support Numeric fields and the performance improvements are significant (couple of orders of magnitude, IIRC.) Check this article about the Numeric quries.

Shashikant Kore
Thanks for your input, Sashikant! Indeed, upgrading to Solr 1.4 (which implements Lucene 2.9) made a great difference. The main advantage of it for us is that it maintains the FieldCache per segment and does not need to reload it after a commit for segments that haven't changed.
schuilr