tags:

views:

141

answers:

1

I am implementing SOLR for a free text search for a project where the records available to be searched will need to be added and deleted on a large scale every day.

Because of the scale I need to make sure that the size of the index is appropriate.

On my test installation of SOLR, I index a set of 10 documents. Then I make a change in one of the document and want to replace the document with the same ID in the index. This works correctly and behaves as expected when I search.

I am using this code to update the document:

getSolrServer().deleteById(document.getIndexId());
getSolrServer().add(document.getSolrInputDocument());
getSolrServer().commit();

What I noticed though is that when I look at the stats page for the SOLR server that the figures are not what I expect.

After the initial index, numDocs and maxDocs both equal 10 as expected. When I update the document however, numDocs is still equal to 10 (expected) but maxDocs equals 11 (unexpected).

When reading the documentation I see that

maxDoc may be larger as the maxDoc count includes logically deleted documents that have not yet been removed from the index.

So the question is, how do I remove logically deleted documents from the index?

If these documents still exist in the index do I run the risk of performance penalties when this is run with a very large volume of documents?

Thanks :)

+3  A: 

You have to optimize your index.

Note that an optimize is expansive, you probably should not do it more than daily.

Here is some more info on optimize:

http://www.lucidimagination.com/search/document/CDRG_ch06_6.3.1.3

http://wiki.apache.org/solr/SolrPerformanceFactors#Optimization_Considerations

Pascal Dimassimo
Thanks - that was exactly what I needed :)
Rachel