ansaurus

Question

Answer 1

+1 A:

The problem is that the code you inherited was written with the mindset "I'll determine the unique set myself" instead of the "let's leverage the framework to do it for me".

I would something like this (pseudocode) instead:

map(key, value):
  ad_id = .. // extract from value
  user_id = ... // extract from value
  collect(ad_id & user_id , unused dummy value) 

reduce(ad_id & user_id , unused dummy value):
  output (ad_id , 1); // one unique userid.

map(ad_id , 1): --> identity mapper!
  collect(ad_id , 1 ) 

reduce(ad_id , set of a lot of '1's):
  summarize ;
  output (ad_id , unique_user_ids);

Niels Basjes 2010-09-28 22:00:56

Answer 2

+1 A:

Niels' solution is good, but for an approximate alternative that is closer to the original code and uses only one map reduce phase, just replace the set with a bloom filter. The membership queries in a bloom filter have a small probability of error, but the size estimates are very accurate.

piccolbo 2010-10-05 00:18:47

ansaurus

tags:

views:

answers:

Efficient set operations in mapreduce

related questions