views:

73

answers:

1

I have a super simple map reduce test... that isn't working consistently. In a nutshell, I'm just looking for duplicate records. I have a collection that has:

GiftIdea - site_id - site_key

the site_id + site_key should be unique, but currently isn't. So I have the following map reduce code:

var map = function() { 
   print(this.site_key); 
   emit(this.site_id + this.site_key, 1);
};
var reduce = function(key,values) { 
   var sum=0;
   for(var i in values){ 
      print(key + ": " + ++sum); 
   } 
   return sum; 
};

With this input data:

GiftIdea
-site_id: amazon -site_key:2 -site_id: amazon -site_key: 2
-site_id: amazon -site_key: 1

So I should get:

amazon1 => 2 amazon2 => 1

Here's what happens when I run it

> o = db.gift_ideas.mapReduce(map,reduce)                                                                        
{
    "result" : "tmp.mr.mapreduce_1283015268_136",
    "timeMillis" : 5,
    "counts" : {
        "input" : 3,
        "emit" : 3,
        "output" : 2
    },
    "ok" : 1,
}

Ok, great news, I've emitted 3 lines, and outputted 2. But I'm getting:

amazon1 => 1.00000 amazon2 => 1.00000

In my log file, I have:

Sat Aug 28 13:22:50 [conn582] CMD: drop personalizr_test.tmp.mr.mapreduce_1283016170_139 Sat Aug 28 13:22:50 [conn582] CMD: drop personalizr_test.tmp.mr.mapreduce_1283016170_139_inc 1 2 1

Key: amazon1 Values: 2 Sat Aug 28 13:22:50 [conn582] building new index on { 0: 1 } for personalizr_test.tmp.mr.mapreduce_1283016170_139_inc Sat Aug 28 13:22:50 [conn582] Buildindex personalizr_test.tmp.mr.mapreduce_1283016170_139_inc idxNo:0 { ns: "personalizr_test.tmp.mr.mapreduce_1283016170_139_inc", key: { 0: 1 }, name: "0_1" } Sat Aug 28 13:22:50 [conn582] done for 2 records 0secs Sat Aug 28 13:22:50 [conn582] building new index on { _id: 1 } for personalizr_test.tmp.mr.mapreduce_1283016170_139 Sat Aug 28 13:22:50 [conn582] Buildindex personalizr_test.tmp.mr.mapreduce_1283016170_139 idxNo:0 { name: "id", ns: "personalizr_test.tmp.mr.mapreduce_1283016170_139", key: { _id: 1 } } Sat Aug 28 13:22:50 [conn582] done for 0 records 0secs Key: amazon1 Values: 1 Key: amazon2 Values: 1 Sat Aug 28 13:22:50 [conn582] CMD: drop personalizr_test.tmp.mr.mapreduce_1283016170_139_inc Sat Aug 28 13:22:50 [conn582] CMD: drop personalizr_test.All ideas grouped by key Sat Aug 28 13:22:50 [conn582] end connection 127.0.0.1:56135

The 1, 2, 1 indicates that the map function is working correctly. That's the right items in the right order, but the reduce function looks odd. I'm calling reduce for amazon1 twice, and the second time, the value is incorrect. The other thing is that it looks like after the first call, mongo is creating an index. I'm guessing that it waits for the first data, to figure out what the data formats are going to be, so it can generate the index appropriately. But I don't understand why I'm getting the call: Key: amazon1 Values: 1 call

Any suggestions?

Few other interesting tidbits: mongo 1.6.1 mongoid 2.0.0.beta16 bson 1.0.4 bson_ext 1.0.4

One thing that is REALLY peculiar, is that tt works on a different database with real data in it!

Here's what one of the records looks like in the populated database:

{ "_id" : ObjectId("4c69b7164914e54d9b007c34"), "avg_score" : null, "category_ids" : [ ], "created_at" : "Thu Aug 19 2010 05:57:25 GMT-0400 (EDT)", "desc" : null, "enabled" : null, "idea_ratings" : [ ], "images" : [
    {
        "url" : "http://ecx.images-amazon.com/images/I/515cLXdLUNL._SL75_.jpg",
        "_id" : ObjectId("4c69b7164914e54d9b007c35"),
        "height" : 61,
        "width" : 75
    }
], "num_ratings" : null, "owner_id" : null, "price" : -1, "rating_stats" : { "_id" : ObjectId("4c7746877719ad0712000dc8"), "total" : -1, "count" : 1, "average" : -1, "sum_of_weights" : 1 }, "ratings" : null, "response_groups" : [ ], "sales_rank" : 40751, "site_id" : "amazon", "site_key" : "B00001OPJE", "title" : "SNK NEOGEO Pocket Color Console in Platinum Silver", "updated_at" : "Fri Aug 27 2010 21:34:40 GMT-0400 (EDT)", "url" : "http://www.amazon.com/NEOGEO-Pocket-Color-Console-Platinum-Silver/dp/B00001OPJE?SubscriptionId=1VHSF1NEXNWHR2A8BA82&tag=gifter-20&linkCode=xm2&camp=2025&creative=165953&creativeASIN=B00001OPJE" }

And here's one of my samples

{ "_id" : ObjectId("4c7948667719ad410f000005"), "created_at" : "Sat Aug 28 2010 13:33:26 GMT-0400 (EDT)", "enabled" : true, "rating_stats" : { "_id" : ObjectId("4c7948667719ad410f00000d"), "total" : 2, "count" : 2, "average" : 1, "sum_of_weights" : 2 }, "sales_rank" : 10, "site_id" : "amazon", "site_key" : "1", "title" : "title1", "updated_at" : "Sat Aug 28 2010 13:33:26 GMT-0400 (EDT)", "url" : "url1" }

Suggestions?

+1  A: 

Ok, thanks to Eliot Horowitz on this one. He told me that my reduce function wasn't correct. Reduce can be re-run, so you need to do something like:

var reduce = function(key,values) { var sum=0; for(var i in values){ sum += values[i]; print(key + ": " + sum); } return sum; };

Jeff D