Our project requires near-real time searches and constant updating. The data is currently stored in a MySQL database and the Lucene index is updated as the database is modified.
We have the search capability currently where we want it. However, we are attempting to add the ability to "tag" documents in the index/database. Since the data pots can be millions of records, we don't want to update the Lucene index for tagging (or if there is a way to mass-update Lucene that might work too). We instead have a table of document IDs in MySQL that we would like to be using to determine the tag sets.
The best option I've so far found is to retrieve both list of IDs as an integer array, sort them (so I only need to loop through once), then loop through and look for matches between the two (though this isn't ideal since we possibly lose sorting).
Attempting to use the list of Lucene IDs in "IN" query in MySQL fails because the number of documents can be in the millions and MySQL chokes on it.
Any insight into how we could optimize this or do it?
Another suggestion was a 2nd index and using a MutliSearcher, but I'm not entirely sure how to go about doing that due to still needing to update the index with a possible million rows when updating or deleting a tag set.