ansaurus

Question

What's the best way to count unique visitors with Hadoop?

Answer 1

+2 A:

You could do it as a 2-stage operation:

First step, emit (username => siteID), and have the reducer just collapse multiple occurrences of siteID using a set - since you'd typically have far less sites than users, this should be fine.

Then in the second step, you can emit (siteID => username) and do a simple count, since the duplicates have been removed.

tzaman 2010-05-21 20:48:35

Answer 2

A:

Use the secondary sort to sort on user id. That way, you don't need to have anything in memory -- just stream the data through, and increment your distinct counter every time you see the value change for a particular site id. Here is some documentation on this: http://hadoop.apache.org/common/docs/current/streaming.html#A+Useful+Partitioner+Class+%28secondary+sort%2C+the+-partitioner+org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner+option%29

SquareCog 2010-05-24 19:22:19

Answer 3

A:

My aproach is similar to what tzaman gave with a small twist

map output : (username, siteid) => ("")
reduce output: (siteid) => (1)
map : identity mapper
reduce : longsumreducer (i.e. simply summarize)

Note that the first reduce does not need to go over any of the records is gets presented. You can simply examine the key and produce the output.

HTH

Niels Basjes 2010-05-26 07:12:53

ansaurus

tags:

views:

answers:

What's the best way to count unique visitors with Hadoop?

related questions