views:

329

answers:

2

How to do this on Google App Engine (Python):

SELECT COUNT(DISTINCT user) FROM event WHERE event_type = "PAGEVIEW" 
AND t >= start_time AND t <= end_time

Long version:

I have a Python Google App Engine application with users that generate events, such as pageviews. I would like to know in a given timespan how many unique users generated a pageview event. The timespan I am most interested in is one week, and there are about a million such events in a given week. I want to run this in a cron job.

My event entities look like this:

class Event(db.Model):
    t = db.DateTimeProperty(auto_now_add=True)
    user = db.StringProperty(required=True)
    event_type = db.StringProperty(required=True)

With an SQL database, I would do something like

SELECT COUNT(DISTINCT user) FROM event WHERE event_type = "PAGEVIEW" 
AND t >= start_time AND t <= end_time

First thought that occurs is to get all PAGEVIEW events and filter out duplicate users. Something like:

query = Event.all()
query.filter("t >=", start_time)
query.filter("t <=", end_time)
usernames = []
for event in query:
    usernames.append(event.user)
answer = len(set(usernames))

But this won't work, because it will only support up to 1000 events. Next thing that occurs to me is to get 1000 events, then when those run out get the next thousand and so on. But that won't work either, because going through a thousand queries and retrieving a million entities would take over 30 seconds, which is the request time limit.

Then I thought I should ORDER BY user to faster skip over duplicates. But that is not allowed because I am already using the inequality "t >= start_time AND t <= end_time".

It seems clear this cannot be accomplished under 30 seconds, so it needs to be fragmented. But finding distinct items seems like it doesn't split well into subtasks. Best I can think of is on every cron jobcall to find 1000 pageview events and then get distinct usernames from those, and put them in an entity like Chard. It could look something like

class Chard(db.Model):
    usernames = db.StringListProperty(required=True)

So each chard would have up to 1000 usernames in it, less if there were duplicates that got removed. After about a 16 hours (which is fine) I would have all the chards and could do something like:

chards = Chard.all()
all_usernames = set()
for chard in chards:
    all_usernames = all_usernames.union(chard.usernames)
answer = len(all_usernames)

It seems like it might work, but hardly a beautiful solution. And with enough unique users this loop might take too long. I haven't tested it in hopes someone will come up with a better suggestion, so not if this loop would turn out to be fast enough.

Is there any prettier solution to my problem?

Of course all of this unique user counting could be accomplished easily with Google Analytics, but I am constructing a dashboard of application specific metrics, and intend this to be the first of many stats.

+1  A: 

Google App Engine and more particular GQL does not support a DISTINCT function.

But you can use Python's set function as described in this blog and in this SO question.

tomlog
Thanks. I was aware of that SO question and the blogpost, but they do not apply to this situation due to the size of the task.
Bemmu
+1  A: 

Here is a possibly-workable solution. It relies to an extent on using memcache, so there is always the possibility that your data would get evicted in an unpredictable fashion. Caveat emptor.

You would have a memcache variable called unique_visits_today or something similar. Every time that a user had their first pageview of the day, you would use the .incr() function to increment that counter.

Determining that this is the user's first visit is accomplished by looking at a last_activity_day field attached to the user. When the user visits, you look at that field, and if it is yesterday, you update it to today and increment your memcache counter.

At midnight each day, a cron job would take the current value in the memcache counter and write it to the datastore while setting the counter to zero. You would have a model like this:

class UniqueVisitsRecord(db.Model):
    # be careful setting date correctly if processing at midnight
    activity_date = db.DateProperty()
    event_count = IntegerProperty()

You could then simply, easily, quickly get all of the UnqiueVisitsRecords that match any date range and add up the numbers in their event_count fields.

Adam Crossland
This relies on your value staying in memcache for an entire day. memcache is a cache, not reliable storage; this is only a good answer if you're happy to lose your count all the time.
Wooble