ansaurus

Question

Counting Unique Users using Mapreduce for Java Appengine

Answer 1

A:

Why on Earth are you using map-reduce for this? Maybe I'm missing something, but this seems like a non-parallelizable problem, so there's no good reason for using a parallel-computing framework.

Why not just do something like this:

SELECT COUNT(DISTINCT user_id_column) FROM user_session_table

See: http://www.w3schools.com/sql/sql_func_count.asp

Mike Baranczak 2010-06-28 02:35:27

What you are missing is that the question is about app-engine, not SQL. You are assuming a SQL database, which is not the case here.

Peter Recore 2010-06-28 03:18:00

Uhhhh have you used appengine? You can't do typical SQL statements like your proposed solution

aloo 2010-06-28 03:18:09

Also, counting records is perfectly suitable for parallel work under many conditions, like when they are stored across a bunch of machines.

Peter Recore 2010-06-28 03:19:56

Answer 2

+1 A:

I haven't actually used the MapReduce functionality yet, but my theoretical understanding is that you can write things to the datastore from within your mapper. You could create an Entity type called something like UniqueCount, and insert one entity every time your mapper sees an ID that it hasn't seen before. then you can count how many unique ID's you have. In fact, you can just update a counter every time you find a new unique entity. You may want to google "sharded counter" for hints on creating a counter in the datastore that can handle high throughput.

Eventually, when they finish the Reduce functionality, I imagine this whole task will become pretty trivial.

Peter Recore 2010-06-28 03:30:33

"every time you find a unique entity" - how do you know if the entity you are looking at (currently mapping) is one you have seen before?

aloo 2010-06-28 04:38:10

Let's say your mapper was just given the entity with userid ABC123. the first thing you will do is check to see if there is a UniqueCount entity for ABC123. If there is, you know you've already accounted for it, and you will do nothing. If there is not, you will create a UniqueCount entity for ABC123. After you've done that for all of your entities, you will have exactly one UniqueCount entity for each user. You can then do a more straightforward count of just the UniqueCount entities.

Peter Recore 2010-06-28 16:07:35

Ahh so this involves creating another Entity type in the datastore... and running two passes. Seems reasonable but was hoping for a simpler solution

aloo 2010-06-28 18:44:48

Well, right now we only have access to half of the map/reduce pair. The reduce phase is normally where you would do the final counting, and account for duplicates. It would still be doing multiple passes, but they would be hidden behind the scenes. right now, you're stuck doing the reduce phase yourself.

Peter Recore 2010-06-28 21:58:04

If you want to do any operation that involves more data than you can process in one request, (like count distinct userids in your session data) you'll be forced to write intermediate values to the datastore between requests. The steps I describe can be done in parallel, rather than two passes. I described them as separate steps to make the process more transparent. Each time you add an entity to the "new" entity type, you could update some global counter.

Peter Recore 2010-06-28 22:04:15

Alright that makes sense thanks!

aloo 2010-06-29 02:12:15

ansaurus

tags:

views:

answers:

Counting Unique Users using Mapreduce for Java Appengine

related questions