views:

528

answers:

2

Hello,

Our company is working on a project that requires a database with 30-50 million rows of product data. These rows contain text that needs to be searched concurrently thousands of times per second. Moreover, each search needs to take less than one second to execute.

So, all in all, we have a 50M row database that needs to be searched thousands of times per second. Keep in mind that these are fulltext searches. I know MySQL or any relational database alone can not handle this type of job. So, we're looking for someone who can design the right setup for us and help us implement, for a price you specify.

First off, we'd like to know what our best options here are. I've personally been researching things such as Sphinx, Lucene, Cassandra, MongoDB, CouchDB, Solr, etc, but really don't know which should be used in conjuction with another to give us the most efficient setup possible.

So, if anyone could just give some advice, or take up our job offer, it would be greatly appreciated.

You can contact me via PM here, and I'll give you my email/IM/phone number to further discuss.

Thanks!

+1  A: 

Paul, welcome to SO. This isn't a really the right place to try to get someone to work for you, but here's my advice:

Truthfully depending on the types of searches you are doing writing MySql off may be a bit premature.

Since it's product data I'd imagine your searches are fulltext searches, so writing off MySql isn't premature. Sphinx is great but a bit of a pain to configure. The benefit is that it has the ability to index from mysql directly, and you can also interface with it with whatever mysql connector/bindings you are using in your application because it knows how to talk mysql's protocol.

I'd say cassandra, couch, and mongo are not really what you are looking for, none of them natively index text the way sphinx does. You could roll your own on top of them but it would be pretty counterproductive.

I've never worked with lucene but I've heard good things, it's a similar solution to Sphinx afaik.

good luck

anq
Hey,Thanks for the response! And yes, I forgot to mention that they are fulltext searches. The reason I'm ruling off MySQL is because of table locking. Fulltext features require myisam, which lock the tables and would therefore hurt the thousands of concurrent searches we'd need performed every second. Also, the fulltext searches are slower than other alternatives. I'm hoping that pairing MySQL up with Sphinx can take care of both these issues, but I'm not really sure, which is why I posted here :)Thanks again!
Paul Bakoyiannis
A: 

Hi

Storing data and searching are two different things. If you look at architectures like ebay, they have seperate services & servers for search operation. 50m rows is nothing, you can store it with any of the datastores, none of them is perfect so the difference is use cases. Eg: cassandra has the fastest insert performance with any data size, can scale to petabytes with hundreds of machines easyly (no need to shard), has lucandra (cassndra-lucene integration, scales well with massive data but a toy when compared to elasticsearch), high durability,... MongoDB has more query options (uses btree as a dbms), has autosharding recently, can index all fields, but poor durability,... Postgresql is the most advanced opensource dbms out there, has builtin master/slave replication recently, can scale by sharding, acid & sql compliant... couchdb has not any advantage compared to others in a use case I think, it's damn slow, If I need acid I probably use postgresql. Builtin fullText search functionality with these datastores has some problems and not scalable.

The most advenced (massive data, high performance, simple, distributed, fault tolerant, rest api) open source search engine is elasticsearch, you can think of it as distributed lucene. Solr is lagecy compared to elascticsearch. use of raw lucene/sphinx is not scalable.

If I were you, I probably choose one of the datastores and use elasticsearh for indexing and synhronize them on my data access layer (need to modify indexes on db insert/update/delete).

Regards

sirmak