getting close to real-time with hadoop

You need to provide a lot more information about the goals and challenges of your system to get good advice. Perhaps Hadoop is not what you need, and you just require some distributed systems foo? (Oh and are you totally sure you require a distributed system? There's an awful lot you can do with a replicated database on top of a couple of large-memory machines).

Knowing nothing about your problem, I'll give you are few shot-in-the-dark attempts at answering.

Take a look at HBase, which provides a structured queriable datastore on top of HDFS, similar to Google's BigTable. http://hadoop.apache.org/hbase/
It could be that you just need some help with managing replication and sharding of data. Check out Gizzard, a middleware to do just that: http://github.com/twitter/gizzard
Processing can always be done beforehand. If that means you materialize too much data, maybe something like Lucandra can help -- Lucene running on top of Cassandra as a backend? http://github.com/tjake/Lucandra

If you really really need to do serious processing at query time, the way to do that is to run dedicated processes that do the specific kinds of computations you need, and use something like Thrift to send requests for computation and receive results back. Optimize them to have all the needed data in-memory. The process that receives the query itself can then do nothing more than break the problem into pieces, send the pieces to compute nodes, and collect the results. This sounds like Hadoop, but is not because it's made for computation of specific problems with pre-loaded data rather than a generic computation model for arbitrary computing.

hmmm! then what is the alternative for making a real-time search experience when large data processing is required for the query

Akhil 2010-05-23 11:48:17

Use a search engine like Lucene.

Marcelo Cantos 2010-05-23 11:51:23

though my code uses lucene in backend but my data is very large and i do a lot of proceesing of documents in the lucene when the query comes in, this processing cannot be done beforehand.So this processing needs to be done in distributed fashion.

Akhil 2010-05-24 03:28:18

It might help to amend your question with more detail on what you are trying to do. In particular, Google doesn't touch its documents when servicing a query. What does your system do that requires more work than a Google search requires?

Marcelo Cantos 2010-05-24 05:41:56

Alright!My documents are some questions(FAQs, millions in number) and the task is to match an incoming noisy query to one of these question(which I will call FAQ here). So in order to match the noisy query to one of the FAQ I have find the similarity between the user query(which is noisy) and the FAQ terms, which requires a lot of processing like calculating LCS, Levenshtein distance, synonym dictionary lookup, etc. In short, this is a little computationally expensive as observed emperically. The FAQ database spans GBs of data(as it contains answer and some other info also).

Akhil 2010-05-24 11:04:38

ansaurus

tags:

views:

answers:

getting close to real-time with hadoop

related questions