I am implementing SOLR for a free text search for a project where the records available to be searched will need to be added and deleted on a large scale every day.
Because of the scale I need to make sure that the size of the index is appropriate.
On my test installation of SOLR, I index a set of 10 documents. Then I make a change in one of the document and want to replace the document with the same ID in the index. This works correctly and behaves as expected when I search.
I am using this code to update the document:
getSolrServer().deleteById(document.getIndexId());
getSolrServer().add(document.getSolrInputDocument());
getSolrServer().commit();
What I noticed though is that when I look at the stats page for the SOLR server that the figures are not what I expect.
After the initial index, numDocs and maxDocs both equal 10 as expected. When I update the document however, numDocs is still equal to 10 (expected) but maxDocs equals 11 (unexpected).
When reading the documentation I see that
maxDoc may be larger as the maxDoc count includes logically deleted documents that have not yet been removed from the index.
So the question is, how do I remove logically deleted documents from the index?
If these documents still exist in the index do I run the risk of performance penalties when this is run with a very large volume of documents?
Thanks :)