Hello
I'm trying to use map/reduce
to process large amounts of binary data. The application is characterized by the following: the number of records is potentially large, such that I don't really want to store each record as a separate file in HDFS
(I was planning to concatenate them all into a single binary sequence file), and each record is a large coherent (i.e. non-splittable) blob, between one and several hundred MB in size. The records will be consumed and processed by a C++ executable. If it weren't for the size of the records, the Hadoop Pipes API would be fine: but this seems to be based around passing the input to map/reduce tasks as a contiguous block of bytes, which is impractical in this case.
I'm not sure of the best way to do this. Does any kind of buffered interface exist that would allow each M/R task to pull multiple blocks of data in manageable chunks? Otherwise I'm thinking of passing file offsets via the API and streaming in the raw data from HDFS on the C++ side.
I'd like to have any opinions from anyone who's tried anything similar - I'm pretty new to hadoop.