views:

80

answers:

3

I'm looking at the need to import a lot of data in realtime into a Lucene index. This will consist of files of various formats (Doc, Docx, Pdf, etc).

The data will be imported as batches compressed files, and so they will need to be decompressed and indexed into an individual file, and somehow relate to the file batch as a whole.

I'm still trying to figure out how to accomplish this, but I think I can use Hadoop for the processing and import into lucene. I can then use Solr as a web interface.

Am I over complicating things since Solr can already process data? Since the CPU load for import is very high (due to pre processing) I believe I need to separate import and casual searching regardless of the implementation.

Q: "Please define a lot of data and realtime"

"A lot" of data is 1 Billion email messages per year (or more), with an average size of 1K, with attachments ranging from 1K to 20 Megs with a small amount of data ranging from 20 Megs to 200 Megs. These are typically attachments that need indexing referenced above.

Realtime means it supports searching within 30 minutes or sooner after it is ready for import.

SLA:

I'd like to provide a search SLA of 15 seconds or less for searching operations.

+2  A: 

If you need the processing done in real-time (or near real-time for that matter) then Hadoop may not be the best choice for you.

Solr already handles all aspects of processing and indexing the files. I would stick with a Solr-only solution first. Solr allows you to scale to multiple machines, so if you find that the CPU load is too high because of the processing, then you can easily add more machines to handle the load.

bajafresh4life
+1  A: 

I suggest that you use Solr Replication to ease the load, by indexing on one machine and retrieving from others. Hadoop is not suitable for real-time processing.

Yuval F
+1  A: 

1 billion documents per year translates to approximately 32 documents per second spread uniformly.

You could run text extraction on a separate machine and send the indexable text to Solr. I suppose, at this scale, you have go for multi-core Solr. So, you can send indexable content to different cores. That should speed up indexing.

I have done indexing of small structured documents in the range of 100 million without much trouble on a single core. You should be able scale to few 100 million documents with a single solr instance. (The text extraction service could use another machine.)

Read about large scale search on Hathi Trust's blog for various challanges and solutions. They use Lucene/Solr.

Shashikant Kore