I have a need to perform distributed searching across a largish set of small files (~10M) with each file being a set of key: value
pairs. I have a set of servers with a total of 56 CPU cores available for this - these are mostly dual core and quad core, but also a large DL785 with 16 cores.
The system needs to be designed for online queries, I'm ideally looking to implement a web-service which returns JSON output on demand from a front-end.
To further complicate matters, for any particular search sometimes I'll only want to look at the latest version of each file, but other searches may only apply to those versions of files that existed on a particular date.
I've looked at Hadoop, but the administration is pretty horrible, and the default job submission methods are slow. It appears to be designed for offline very large scale processing, and not for online data processing.
CouchDB looks nice as a document store and knows about key: value
style documents and versioning and MapReduce, but I can't find anything about how it can be used as a distributed MapReduce system. All of the clustering documentation talks about using clustering and replication of the entire database for load-balancing, whereas what I need is load-distribution.
I've also investigated various DHTs, and whilst they're fine for actually storing and retrieving individual records, they're generally poor at doing the 'map' part of MapReduce. Iterating over the complete document set is crucial.
Hence my ideal system would comprise a distributed file system like Hadoop's HDFS, with the web-service capabilities of CouchDB.
Can anyone point me in the direction of anything that might help? Implementation language isn't too much of a concern, except that it must run on Linux.