views:

77

answers:

3

This is no doubt another noobish question, but I'll ask it anyways:

I have a data set of events with exact datetime in UTC. I'd like to create a line chart showing total number of events by day (date) in the specified date range. Right now I can retrieve the total data set for the needed date range, but then I need to go through it and count up for each date.

The app is running on google app engine and is using python.

What is the best way to create a new data set showing date and corresponding counts (including if there were no events on that date) that I can then use to pass this info to a django template?

Data set for this example looks like this:

class Event(db.Model):
    event_name = db.StringProperty()
    doe = db.DateTimeProperty()
    dlu = db.DateTimeProperty()
    user = db.UserProperty()

Ideally, I want something with date and count for that date.

Thanks and please let me know if something else is needed to answer this question!

A: 

I can't see how that would be possible with single query as GQL has no support for GROUP BY or aggregation generally.

jholster
This can't be done in GQL, but I'm hoping someone can help me do this in memory
Sologoub
+1  A: 

You'll have to do the binning in-memory (i.e. after the datastore fetch).

The .date() method of a datetime instance will facilitate your binning; it chops off the time element. Then you can use a dictionary to hold the bins:

bins = {}
for event in Event.all().fetch(1000):
    bins.setdefault(event.doe.date(), []).append( event )

Then do what you wish with (e.g. count) the bins. For a direct count:

counts = collections.defaultdict(int)
for event in Event.all().fetch(1000):
    counts[event.doe.date()] += 1
kostmo
sorry, can you elaborate a little or maybe include a link to a tutorial about "binning"?How do I retrieve a count for say 01/01/2010 from "counts"?
Sologoub
Ummmm ... you are asking how to retrieve a value from a dictionary? Something like this: `key = datetime.date(2010, 1, 1); the_answer = counts[key]` ... by the way, @kostmo's answer is incomplete; as you want ultimately a table that includes zero counts, you need to iterate over your whole date range either before or after counting.
John Machin
@John Machin: Actually, in my second snippet (the direct count), the defaultdict will automatically return "0" for any key that hasn't been explicitly set.
kostmo
@Sologoub: "Binning" is what you are asking for--that is, placing all events that occur on the same day into a single bin for that day... aka "histogramming".
kostmo
@kostmo: Yes of course it will automatically return etc etc but you didn't tell him to iterate over his dictionary to build the table he wants.
John Machin
thanks everyone! I'll give this a try and be back if I mess something up.
Sologoub
Just a note for whoever reads this in the future - here's the tutorial on dictionaries: http://docs.python.org/tutorial/datastructures.html
Sologoub
A: 

In order to minimize the amount of work you do, you'll probably want to write a task that sums up the per-day totals once, so you can reuse them. I'd suggest using the bulkupdate library to run a once-a-day task that counts events for the previous day, and creates a new model instance, with a key name based on the date, containing the count. Then, you can get all needed data points by doing a query (or better, a batch get) for the set of summary entities you need.

Nick Johnson
Thanks Nick. One problem is that I'm putting together a basic reporting engine with a number of filters. I did think of building aggregation tables (sort of data warehouse?) but that's going to be at a later stage. Also, google app engine offers decent cronjob interface, so running nightly aggregations should not be hard.
Sologoub
Bear in mind that cronjobs alone won't cut it for nightly aggregations if the number of records to aggregate becomes too high - hence my suggestion of the bulkupdate library.
Nick Johnson