views:

90

answers:

2

I have a list of users that only administrators can see (= few reads). This list also displays a count of the number of users in the datastore. Because the list could grow larger than 1000 my first thought was to avoid a normal count() and instead use a sharded counter.

However, the problem is that the admins also have access to various search filters (in the GUI), such as only viewing male/female users and so on. It's important that the count reflects these filters, so that they can get the number of female users, male users and a myriad of other combinations.

Because of this, sharded counters and high concurrency counters without sharding don't seem like a good idea, because I would need to create a counter for every combination of search filters.

Should I simply create a loop of count() methods, such as described here or is this very bad practice? How would I do it otherwise?

Note that this counter is for an admin interface and would have a very limited number of reads. This is really a case of when I would like to sacrifice some read performance for flexibility and accuracy. Although it should be able to grow beyond 1000, it's not expected to grow larger than 10 000.

+2  A: 

"Loop of counts" is slow, but these days you can make it a bit better with cursors. Normally I would recommend denormalizing into all the "filtered" counters you need, but that slows down user addition and deletion (and probably demographic changes as well), so, given your particular use case with a very low volume of reads, you can probably get away with the "loop of counts" approach (plus cursors;-).

Alex Martelli
Thanks for your answer! Yes, I'm tempted by this approach considering I'll have very few reads and I'm not even sure the list will exceed 1000. When you talk about cursors, do you mean I should use cursors to decide the next position of count()?
Aneon
+2  A: 

I've tried two approaches:

1) Write my own task that queries the data store (the query is a key descending query) with a fixed limit of entities (say 50). It then enqueues the next task to start querying where it left off. Each task enqueues the next one passing it two parameters (where it last left off like a cursor and a running total of the number of entities it has seen).

2) This approach is much easier - and that is to use the mapreduce library provided by google for appengine. It runs totally in user space so you just have to download and build the library and include it in your project. Basically, it will handle iterating through all the entities you specify and lets you write a handler for what to do with each one (like incrementing a counter). See the details here: mapreduce.appspot.com - they even have a sample app that does just what you are asking for. THe only problem with this is that the results will appear in your browser and not necessarily stored in the datastore unless you do that yourself.

aloo
The second approach described here, using a mapreduce to recalculate, on a regular basis, all the important statistics, seems like the best approach.
Nick Johnson
Oh, I've never heard of MapReduce before, will have to look into that. Would this approach give me full accuracy or will it need to be updated periodically (like high concurrency counters without sharding that uses the task queue)? And does it require me to set up all possible filter combinations I want to be able to count manually?
Aneon
Well if the number of entities you have is changing during the map reduce, those entities will not be counted. The map reduce basically takes a snapshot at a certain point in time. In will NOT give you a real time count of the number of entities you have at any given moment. I use it to produce stats at the end of every day.
aloo
With respect to your filters - you will have to have a count for every possible combination - BUT you can have multiple counters for a single map reduce. A single map reduce can only iterate over one entity type. So say you have a person entity. For each entity that you map over you will increment a counter called "total". You will also increment a counter called "male" if the person is a male. You would also increment a counter called "male-birthday-in-june" to be able to filter entities by their sex and birthday month.
aloo
You can have many of these counters and their results will be displayed at the end of the map reduce. One thing I havent yet figured out is how to store the results of these map reduces to the datastore. Here's where I ask that and another question - both of which are not possible yet.... http://stackoverflow.com/questions/3129430/counting-unique-users-using-mapreduce-for-java-appengine
aloo