I'm looking for a full text indexing package that is being maintained (i.e. not an end of life dead package) that can would ideally have support for:
- substring matches
- incremental updates
- line level results
Also ideal would be support for
- boolean matches
- adjacency searches "stringX found near stringY"
A little more detail about the situation - I currently have a 'grep on steroids' that searches through system log files stored in a central location, split by host and day, updated continuously.
- approximately 40-80 GB of mixed compressed and raw files
- raw uncompressed data size - 350 - 500 GB
- 20,000+ files
A solution like Splunk would be ideal, but pricing for our data change rate (2-4GB/day) - even with educational organization pricing - is outrageously high.
I have used freeWAIS-sf in the past, and am currently using namazu for limited indexing of a small document set elsewhere.
I don't require spidering support, I can feed it a list of files to index and they will all be on local disk.
Problem is - freeWAIS-sf appears to essentially be abandoned, and namazu doesn't have any line-level results - only by-file.
Any suggestions for products to use? One option I did consider was to use something like namazu, but to split the files before indexing into chunks and post-process search results to reassemble, but that seems very hackish.
EDIT
I'm open to building multiple indexes as well as a way of doing incremental updates - even though I'd have to aggregate the multiple search results.
I can also live with a delay on indexing for 'Todays' results, indexing doesn't have to be real-time.
EDIT
Solr appears to be quite useful as a tool, however, it looks to have the same issue as using namazu or the others - if I want file level positions of the results - I basically have to do it myself externally - or pre-split the file into chunks as I generate the XML to load into the index server. While this does provide a very structured way of doing it, if I have to do all that myself, it's going back to the starting point.