views:

200

answers:

2

I have a need to perform distributed searching across a largish set of small files (~10M) with each file being a set of key: value pairs. I have a set of servers with a total of 56 CPU cores available for this - these are mostly dual core and quad core, but also a large DL785 with 16 cores.

The system needs to be designed for online queries, I'm ideally looking to implement a web-service which returns JSON output on demand from a front-end.

To further complicate matters, for any particular search sometimes I'll only want to look at the latest version of each file, but other searches may only apply to those versions of files that existed on a particular date.

I've looked at Hadoop, but the administration is pretty horrible, and the default job submission methods are slow. It appears to be designed for offline very large scale processing, and not for online data processing.

CouchDB looks nice as a document store and knows about key: value style documents and versioning and MapReduce, but I can't find anything about how it can be used as a distributed MapReduce system. All of the clustering documentation talks about using clustering and replication of the entire database for load-balancing, whereas what I need is load-distribution.

I've also investigated various DHTs, and whilst they're fine for actually storing and retrieving individual records, they're generally poor at doing the 'map' part of MapReduce. Iterating over the complete document set is crucial.

Hence my ideal system would comprise a distributed file system like Hadoop's HDFS, with the web-service capabilities of CouchDB.

Can anyone point me in the direction of anything that might help? Implementation language isn't too much of a concern, except that it must run on Linux.

+1  A: 

It seems like the problem domain would be better suited to a solution like Solr. Solr offers http interfaces to other applications, even JSON. You could partition the search across multiple machines or distribute a single copy across machines for load balancing(master/slave). It would depend on what worked best for your data. But in my experience for real-time search results, Lucene/Solr is going to beat any system based on a map/reduce system.

It's very simple to integrate Solr into an application and to do incremental updates. It doesn't really have any idea of versioning though. If that's really necessary you might have to find another way to tack it on.

bmccormick
Solr looks interesting - thinking about it more it's possible that all of our use cases might be doable with "search" rather than "map/reduce". One thing I can't find is whether Solr will distribute a single query across multiple CPU threads?
Alnitak
and btw I don't mean shards - I have 15 systems available, each with anything between 2 and 16 cores available. I don't want to run a separate shard for each core, just one for each system.
Alnitak
A: 

I may be a bit confused on what your application needs are, you mention needing to be able to search through key/value pairs, where Solr would be a great application. But you also mention needing to use the map part of map/reduce and that you need to scan 10M documents. I'm not sure you're going to find a solution that will scan 10M documents and return results in an online fashion (in the millisecond range). But another solution is too look at HBase. This builds on top of HDFS and allows you to run map reduce jobs of the type that you want, millions of smaller items. But a job isn't going to be submittable and finish in anywhere near the time you're looking for.

I currently have a test HBase set up with RSS items (2M items, several Kb per item). Total DB size is ~5Gb. There are several jobs that run against this DB scanning all of the items and then outputting results. The cluster will scan items at ~5,000 / second, but it still takes around 10min to complete a job.

bmccormick