views:

169

answers:

2

I need some good references for using Hadoop for real-time systems like searching with little response time. I know hadoop has its overhead of hdfs, but whats the best way of doing this with hadoop.

+4  A: 

Hadoop is completely the wrong tool for this kind of requirement. It is explicitly optimised for large batch jobs that run for several minutes up to hours or even days.

FWIW, HDFS has nothing to do with the overhead. It's the fact that Hadoop jobs deploy a jar file onto every node, setup a working area, start each job running, pass information via files between stages of the computation, communicate progress and status with the job runner, etc., etc.

Marcelo Cantos
hmmm! then what is the alternative for making a real-time search experience when large data processing is required for the query
Akhil
Use a search engine like Lucene.
Marcelo Cantos
though my code uses lucene in backend but my data is very large and i do a lot of proceesing of documents in the lucene when the query comes in, this processing cannot be done beforehand.So this processing needs to be done in distributed fashion.
Akhil
It might help to amend your question with more detail on what you are trying to do. In particular, Google doesn't touch its documents when servicing a query. What does your system do that requires more work than a Google search requires?
Marcelo Cantos
Alright!My documents are some questions(FAQs, millions in number) and the task is to match an incoming noisy query to one of these question(which I will call FAQ here). So in order to match the noisy query to one of the FAQ I have find the similarity between the user query(which is noisy) and the FAQ terms, which requires a lot of processing like calculating LCS, Levenshtein distance, synonym dictionary lookup, etc. In short, this is a little computationally expensive as observed emperically. The FAQ database spans GBs of data(as it contains answer and some other info also).
Akhil
+4  A: 

You need to provide a lot more information about the goals and challenges of your system to get good advice. Perhaps Hadoop is not what you need, and you just require some distributed systems foo? (Oh and are you totally sure you require a distributed system? There's an awful lot you can do with a replicated database on top of a couple of large-memory machines).

Knowing nothing about your problem, I'll give you are few shot-in-the-dark attempts at answering.

  1. Take a look at HBase, which provides a structured queriable datastore on top of HDFS, similar to Google's BigTable. http://hadoop.apache.org/hbase/
  2. It could be that you just need some help with managing replication and sharding of data. Check out Gizzard, a middleware to do just that: http://github.com/twitter/gizzard
  3. Processing can always be done beforehand. If that means you materialize too much data, maybe something like Lucandra can help -- Lucene running on top of Cassandra as a backend? http://github.com/tjake/Lucandra

If you really really need to do serious processing at query time, the way to do that is to run dedicated processes that do the specific kinds of computations you need, and use something like Thrift to send requests for computation and receive results back. Optimize them to have all the needed data in-memory. The process that receives the query itself can then do nothing more than break the problem into pieces, send the pieces to compute nodes, and collect the results. This sounds like Hadoop, but is not because it's made for computation of specific problems with pre-loaded data rather than a generic computation model for arbitrary computing.

SquareCog
cool! that's seems a list of good and new pointers for me. I will have a look at these.
Akhil