Hey. I am Developing a Web Crawler,Which is Good for storing data? Cassandra or Hadoop Hive or MySQL?and why?i am having 1TB of Data from past 6 Months in my MySQL DB,i need to index them and i need to get the out put in my search ASAP,and as i think,it will store more amount of DATA,like 10 Peta Byes as my crawler are working fast,i need to get the read/write operation fast,i need to integrate it in my PHP app
+2
A:
That depends on details of your requirements, but I think that in your case HBase would be the best option.
Using HBase as a web-crawler database is well documented and it's HBase's use that is described in BigTable whitepaper.
Wojtek
2010-08-17 22:32:45
A:
You're looking for something that's meant for finding documents based on their content -- it should be based on an inverted index. I think that the most natural fit would be Lucene.
See also this article about a Hadoop-Lucene stack for querying terabytes of documents.
Ken Bloom
2010-08-20 03:48:07