How to do this on Google App Engine (Python):
SELECT COUNT(DISTINCT user) FROM event WHERE event_type = "PAGEVIEW"
AND t >= start_time AND t <= end_time
Long version:
I have a Python Google App Engine application with users that generate events, such as pageviews. I would like to know in a given timespan how many unique users generated a pageview event. The timespan I am most interested in is one week, and there are about a million such events in a given week. I want to run this in a cron job.
My event entities look like this:
class Event(db.Model):
t = db.DateTimeProperty(auto_now_add=True)
user = db.StringProperty(required=True)
event_type = db.StringProperty(required=True)
With an SQL database, I would do something like
SELECT COUNT(DISTINCT user) FROM event WHERE event_type = "PAGEVIEW"
AND t >= start_time AND t <= end_time
First thought that occurs is to get all PAGEVIEW events and filter out duplicate users. Something like:
query = Event.all()
query.filter("t >=", start_time)
query.filter("t <=", end_time)
usernames = []
for event in query:
usernames.append(event.user)
answer = len(set(usernames))
But this won't work, because it will only support up to 1000 events. Next thing that occurs to me is to get 1000 events, then when those run out get the next thousand and so on. But that won't work either, because going through a thousand queries and retrieving a million entities would take over 30 seconds, which is the request time limit.
Then I thought I should ORDER BY user to faster skip over duplicates. But that is not allowed because I am already using the inequality "t >= start_time AND t <= end_time".
It seems clear this cannot be accomplished under 30 seconds, so it needs to be fragmented. But finding distinct items seems like it doesn't split well into subtasks. Best I can think of is on every cron jobcall to find 1000 pageview events and then get distinct usernames from those, and put them in an entity like Chard. It could look something like
class Chard(db.Model):
usernames = db.StringListProperty(required=True)
So each chard would have up to 1000 usernames in it, less if there were duplicates that got removed. After about a 16 hours (which is fine) I would have all the chards and could do something like:
chards = Chard.all()
all_usernames = set()
for chard in chards:
all_usernames = all_usernames.union(chard.usernames)
answer = len(all_usernames)
It seems like it might work, but hardly a beautiful solution. And with enough unique users this loop might take too long. I haven't tested it in hopes someone will come up with a better suggestion, so not if this loop would turn out to be fast enough.
Is there any prettier solution to my problem?
Of course all of this unique user counting could be accomplished easily with Google Analytics, but I am constructing a dashboard of application specific metrics, and intend this to be the first of many stats.