I have some 25k documents (4 GB in raw json) of data that I want to perform a few javascript operations on to make it more accessible to my end data consumer (R
), and I would like to sort of "version control" these changes by adding a new collection for each change, but I cannot figure out how to map/reduce
without the reduce
. I want a one-to-one document mapping—I start out with 25,356 documents in collection_1
, and I want to end up with 25,356 documents in collection_2
.
I can hack it with this:
var reducer = function(key, value_array) {
return {key: value_array[0]}
}
And then call it like:
db.flat_1.mapReduce(mapper, reducer, {keeptemp: true, out: 'flat_2'})
(My mapper only calls emit once, with a string as the first argument and the final document as the second. It's a collection of those second arguments that I really want.)
But that seems awkward and I don't know why it even works, since my emit
call arguments in my mapper are not equivalent to the return argument of my reducer
. Plus, I end up with a document like
{
"_id": "0xWH4T3V3R",
"value": {
"key": {
"finally": ["here"],
"thisIsWhatIWanted": ["Yes!"]
}
}
}
which seems unnecessary.
Also, a cursor that performs its own inserts isn't even a tenth as fast as mapReduce
. I don't know MongoDB well enough to benchmark it, but I would guess it's about 50x
slower. Is there a way to run through a cursor in parallel? I don't care if the documents in my collection_2
are in a different order than those in collection_1
.