views:

152

answers:

3

Hey. I am Developing a Web Crawler,Which is Good for storing data? Cassandra or Hadoop Hive or MySQL?and why?i am having 1TB of Data from past 6 Months in my MySQL DB,i need to index them and i need to get the out put in my search ASAP,and as i think,it will store more amount of DATA,like 10 Peta Byes as my crawler are working fast,i need to get the read/write operation fast,i need to integrate it in my PHP app

+2  A: 

That depends on details of your requirements, but I think that in your case HBase would be the best option.
Using HBase as a web-crawler database is well documented and it's HBase's use that is described in BigTable whitepaper.

Wojtek
A: 

Hi,

You can use cassandra with elasticsearch.

sirmak
A: 

You're looking for something that's meant for finding documents based on their content -- it should be based on an inverted index. I think that the most natural fit would be Lucene.

See also this article about a Hadoop-Lucene stack for querying terabytes of documents.

Ken Bloom