ansaurus

Question

Answer 1

+1 A:

You are not doing anything wrong. (Besides sorting on the wrong value as you already noticed in your comments.)

MongoDB map/reduce performance just isn't that great. This is a known issue; see for example http://jira.mongodb.org/browse/SERVER-1197 where a naive approach is ~350x faster than M/R.

One advantage though is that you can specify a permanent output collection name with the out argument of the mapReduce call. Once the M/R is completed the temporary collection will be renamed to the permanent name atomically. That way you can schedule your statistics updates and query the M/R output collection real-time.

mischa_u 2010-10-16 17:29:35

Thanks for the response. I'm going to leave the question unanswered for just a bit longer to see if anyone else has some input. This is really disappointing though. I wonder where the bottle neck is? Perhaps because MongoDB is single threaded, so the server coordinating all the shards can only go so fast? I'm also curious about the results. It appears all 10 million docs where mapped, when most should have been excluded by the query.

mellowsoon 2010-10-16 18:02:12

@mellowsoon:Verify your query by doing a count on the collection with the same arguments (and remember that the month for a JS Date object is zero-based indexed).

mischa_u 2010-10-16 18:33:02

@mischa_u - Thanks, I'm doing that now. I've done a complete fresh install of Mongo on the 3 servers, and I'm importing the data now. Once that's done, I'll look at how the data is distributed between the shards, and pick a date range that should put half the matching docs on each shard.

mellowsoon 2010-10-16 19:12:46

Just wanted to add a P.S.: WTF on months starting on zero?!

mellowsoon 2010-10-16 19:27:53

Answer 2

A:

You can also export the data to an rdbms like mysql and do a grouo by in mysql. That is probably faster, exporting takes time but it is probably still faster.

TTT 2010-10-17 02:57:57

He just exported the data from an RDBMS into Mongo for testing. Clearly that's going the wrong way.

Gates VP 2010-10-19 04:29:22

@Gated VP: Why, you can use MongoDB for oltp and the rdbms for analytics. See also: stackoverflow.com/questions/2599943/2613106#2613106 One of the ideas of the nosql movement is to end the one database fits all thinking.

TTT 2010-10-19 20:57:44

So it makes sense to use two db's simultaneously.

TTT 2010-10-19 21:00:56

Answer 3

+3 A:

excerpts from MongoDB Definitive Guide from O'Reilly:

The price of using MapReduce is speed: group is not particularly speedy, but MapReduce is slower and is not supposed to be used in “real time.” You run MapReduce as a background job, it creates a collection of results, and then you can query that collection in real time.

options for map/reduce:
"keeptemp" : boolean 
If the temporary result collection should be saved when the connection is closed. 
"output" : string 
Name for the output collection. Setting this option implies keeptemp : true.

動靜能量 2010-10-17 03:18:05

I think I misunderstood the purpose of MapReduce. I thought it was used to process a large amount of data faster than alternatives. I think I see now that it's more about the ability to process **huge** amounts of data that would otherwise be impossible to process on a single machine, and speed isn't a factor.

mellowsoon 2010-10-17 03:22:48

@mellowsoon, of course the purpose of mapreduce is to process a large or huge amount of data fast. It is just MongoDB's implementation that isn't very fast.

TTT 2010-10-19 21:00:05

@TTT - Thank you! Right now I'm thinking mongodb is still the right choice for the type of data we're trying to save, but it looks like I might have to use some other mapreduce technologies to actually crunch the data.

mellowsoon 2010-10-19 21:39:08

ansaurus

tags:

views:

answers:

MongoDB: Terrible MapReduce Performance

related questions