views:

76

answers:

2

I'm trying to count the number of unique users per day on my java appengine app. I have decided to use the mapreduce framework (mapreduce.appspot.com) for java appengine to do this calculation offline. I've managed to create a map reduce job that goes through all of my entities which represent a single users session event. I can use a simple counter as well. I have several questions though:

1) How do I only increment a counter once for each user id? I am currently mapping over entities which contain a user id property but many of these entities may contain the same user id so how do I only count it once?

2) Once I have these results of the job stored in these counters - how can I persist them to the datastore? I see the results of the counters on the mapreduce's status page but I want these results automatically persisted to the datastore.

Ideas?

A: 

Why on Earth are you using map-reduce for this? Maybe I'm missing something, but this seems like a non-parallelizable problem, so there's no good reason for using a parallel-computing framework.

Why not just do something like this:

SELECT COUNT(DISTINCT user_id_column) FROM user_session_table 

See: http://www.w3schools.com/sql/sql_func_count.asp

Mike Baranczak
What you are missing is that the question is about app-engine, not SQL. You are assuming a SQL database, which is not the case here.
Peter Recore
Uhhhh have you used appengine? You can't do typical SQL statements like your proposed solution
aloo
Also, counting records is perfectly suitable for parallel work under many conditions, like when they are stored across a bunch of machines.
Peter Recore
+1  A: 

I haven't actually used the MapReduce functionality yet, but my theoretical understanding is that you can write things to the datastore from within your mapper. You could create an Entity type called something like UniqueCount, and insert one entity every time your mapper sees an ID that it hasn't seen before. then you can count how many unique ID's you have. In fact, you can just update a counter every time you find a new unique entity. You may want to google "sharded counter" for hints on creating a counter in the datastore that can handle high throughput.

Eventually, when they finish the Reduce functionality, I imagine this whole task will become pretty trivial.

Peter Recore
"every time you find a unique entity" - how do you know if the entity you are looking at (currently mapping) is one you have seen before?
aloo
Let's say your mapper was just given the entity with userid ABC123. the first thing you will do is check to see if there is a UniqueCount entity for ABC123. If there is, you know you've already accounted for it, and you will do nothing. If there is not, you will create a UniqueCount entity for ABC123. After you've done that for all of your entities, you will have exactly one UniqueCount entity for each user. You can then do a more straightforward count of just the UniqueCount entities.
Peter Recore
Ahh so this involves creating another Entity type in the datastore... and running two passes. Seems reasonable but was hoping for a simpler solution
aloo
Well, right now we only have access to half of the map/reduce pair. The reduce phase is normally where you would do the final counting, and account for duplicates. It would still be doing multiple passes, but they would be hidden behind the scenes. right now, you're stuck doing the reduce phase yourself.
Peter Recore
If you want to do any operation that involves more data than you can process in one request, (like count distinct userids in your session data) you'll be forced to write intermediate values to the datastore between requests. The steps I describe can be done in parallel, rather than two passes. I described them as separate steps to make the process more transparent. Each time you add an entity to the "new" entity type, you could update some global counter.
Peter Recore
Alright that makes sense thanks!
aloo