I'm looking at the need to import a lot of data in realtime into a Lucene index. This will consist of files of various formats (Doc, Docx, Pdf, etc).
The data will be imported as batches compressed files, and so they will need to be decompressed and indexed into an individual file, and somehow relate to the file batch as a whole.
I'm still trying to figure out how to accomplish this, but I think I can use Hadoop for the processing and import into lucene. I can then use Solr as a web interface.
Am I over complicating things since Solr can already process data? Since the CPU load for import is very high (due to pre processing) I believe I need to separate import and casual searching regardless of the implementation.
Q: "Please define a lot of data and realtime"
"A lot" of data is 1 Billion email messages per year (or more), with an average size of 1K, with attachments ranging from 1K to 20 Megs with a small amount of data ranging from 20 Megs to 200 Megs. These are typically attachments that need indexing referenced above.
Realtime means it supports searching within 30 minutes or sooner after it is ready for import.
SLA:
I'd like to provide a search SLA of 15 seconds or less for searching operations.