views:

236

answers:

3

I'm looking for a full text indexing package that is being maintained (i.e. not an end of life dead package) that can would ideally have support for:

  • substring matches
  • incremental updates
  • line level results

Also ideal would be support for

  • boolean matches
  • adjacency searches "stringX found near stringY"

A little more detail about the situation - I currently have a 'grep on steroids' that searches through system log files stored in a central location, split by host and day, updated continuously.

  • approximately 40-80 GB of mixed compressed and raw files
  • raw uncompressed data size - 350 - 500 GB
  • 20,000+ files

A solution like Splunk would be ideal, but pricing for our data change rate (2-4GB/day) - even with educational organization pricing - is outrageously high.

I have used freeWAIS-sf in the past, and am currently using namazu for limited indexing of a small document set elsewhere.

I don't require spidering support, I can feed it a list of files to index and they will all be on local disk.

Problem is - freeWAIS-sf appears to essentially be abandoned, and namazu doesn't have any line-level results - only by-file.

Any suggestions for products to use? One option I did consider was to use something like namazu, but to split the files before indexing into chunks and post-process search results to reassemble, but that seems very hackish.

EDIT

I'm open to building multiple indexes as well as a way of doing incremental updates - even though I'd have to aggregate the multiple search results.

I can also live with a delay on indexing for 'Todays' results, indexing doesn't have to be real-time.

EDIT

Solr appears to be quite useful as a tool, however, it looks to have the same issue as using namazu or the others - if I want file level positions of the results - I basically have to do it myself externally - or pre-split the file into chunks as I generate the XML to load into the index server. While this does provide a very structured way of doing it, if I have to do all that myself, it's going back to the starting point.

+1  A: 

Check out SWISH-E; I believe it does everything you need.

Joe Casadonte
Swish doesn't appear to have line level results...
Nathan Neulinger
+2  A: 

Have you looked at Lucene and its derivatives like Solr?

Ray
Looking into Lucene now... Had seen it before, but not dug in deeply. Looks like it might fit the bill.
Nathan Neulinger
Same issue with solr as swish/others - no built in support for file offset, to do that, I'll have to index each file as multiple objects in the index.
Nathan Neulinger
A: 

Check out Sphinx. Open source, freely available, and high speed.

gms8994
That site appears to be all Chinese ... not so useful for an English speaker such as myself.
offby1
That should be sphinxsearch.com, not .org. No idea why I would have put that in like that ;)
gms8994