views:

190

answers:

2

This is applicable to Google App Engine, but not necessarily constrained for it.

On Google App Engine, the database isn't relational, so no aggregate functions (such as sum, average etc) can be implemented. Each row is independent of each other. To calculate sum and average, the app simply has to amortize its calculation by recalculating for each individual new write to the database so that it's always up to date.

How would one go about calculating percentile and frequency distribution (i.e. density)? I'd like to make a graph of the density of a field of values, and this set of values is probably on the order of millions. It may be feasible to loop through the whole dataset (the limit for each query is 1000 rows returned), and calculate based on that, but I'd rather do some smart approach.

Is there some algorithm to calculate or approximate density/frequency/percentile distribution that can be calculated over a period of time?

By the way, the data is indeterminate in that the maximum and minimum may be all over the place. So the distribution would have to take approximately 95% of the data and only do a density based on that.

A: 

It may be feasible to loop through the whole dataset (the limit for each query is 1000 rows returned), and calculate based on that, but I'd rather do some smart approach.

This is the most obvious approach to me, why are you are you trying to avoid it?

hhafez
GAE puts limits on how long an operation can take and how much datastore CPU time. Everything is done as an http request, so there's only so much data you can churn through per request. Dividing a big job into multiple operations and combining the results might be too much trouble if there's a simpler approach.
Steve Jessop
+2  A: 

Getting the whole row (with that limit of 1000 at a time...) over and over again in order to get a single number per row is sure unappealing. So denormalize the data by recording that single number in a separate entity that holds a list of numbers (to a limit of I believe 1 MB per query, so with 4-byte numbers no more than 250,000 numbers per list).

So when adding a number also fetch the latest "added data values list" entity, if full make a new one instead, append the new number, save it. Probably no need to be transactional if a tiny error in the statistics is no killer, as you appear to imply.

If the data for an item can be changed have separate entities of the same kind recording the "deleted" data values; to change one item's value from 23 to 45, add 23 to the latest "deleted values" list, and 45 to the latest "added values" one -- this covers item deletion as well.

Alex Martelli
But what does that do? Instead of a row for each number, I now have a row for 250,000 numbers. How can that be used?Your answer made me think that it might work if I cherry pick a number for every 1000 numbers, so that I obtain a statistically relevant sample that's also small enough to perform calculations on...
MTsoul
You have one entity per 250k numbers, so within 1000 rows you can have 250 million numbers. As you say that "this set of values is probably on the order of millions" as opposed to hundreds of millions, you should be able to fetch the relevant data back in a single query and perform whatever processing you want (if that exceeds a slice of CPU time, slice the work itself up in reasonable increments, of course).
Alex Martelli
Ah that makes sense. I will need to play around with this possibility then. Thanks.
MTsoul