ansaurus

Question

incremental way of counting quantiles for large set of data

Answer 1

A:

Only retrieve the data you really need -- i.e., whatever value(s) is/are being used as the key for sorting, not everything else associated with it.
You can probably use Tony Hoare's Select algorithm to find your quantile more quickly than sorting all the data.

Jerry Coffin 2010-05-14 20:26:10

Answer 2

A:

If your data has a Gaussian distribution, you can estimate the quantiles from the standard deviation. I assume your data isn't Gaussian distributed or you'd just be using the SD anyway.

If you can pass through your data twice, I'd do the following:

First pass, compute the max, min, SD and mean.
Second pass, divide the range [min,max] into some number of buckets (e.g. 100); do the same for (mean - 2*SD,mean + 2*SD) (with extra buckets for outliers). Then run through the data again, tossing numbers into these buckets.
Count buckets until you are at 25% and 75% of the data. If you want to get extra-fancy, you can interpolate between bucket values. (I.e. if you need 10% of a bucket to hit your 25th quantile, assume the value is 10% of the way from the low bound to the upper bound.)

This should give you a pretty good linear-time algorithm that works okay for most sets of not-entirely-perverse data.

Rex Kerr 2010-05-14 21:18:11

Answer 3

A:

I second the idea of using buckets. Don't limit yourself to 100 buckets - might as well use 1 million. The tricky part is to pick your bucket ranges so that everything doesn't end up in a single bucket. Probably the best way to estimate your bucket ranges is to take a reasonable random sample of your data, compute the 10% and 90% quantiles using the simple sort algorithm, then generate equal-sized buckets to fill that range. It isn't perfect, but if your data isn't from a super-weird distribution, it should work.

If you can't do random samples, you're in more trouble. You can pick an initial bucketing guess based on your expected data distribution, then while working through your data if any bucket (typically the first or last bucket) gets overfull, start over again with a new bucket range.

Keith Randall 2010-05-15 00:01:41

Answer 4

A:

Inspired by this answer I created a method that estimates the quantiles quite good. It is approximation close enough for my purposes.

The idea is following: the 0.75 quantile is in fact a median of all values that lies above the global median. And respectively, 0.25 quantile is a median of all values below the global median.

So if we can approximate the median, we can in similar way approximate the quantiles.

double median = 0;
double q1 = 0;
double q3 = 0;
double eta = 0.005;

foreach( var value in listOfValues) // or stream, or any other large set of data...
{
    median += eta * Math.Sign(p.Int - median);
}
// Second pass. We know the median, so we can count the quantiles.
foreach(var value in listOfValues)
{ 
    if(p.Int < median)
        q1 += eta*Math.Sign(p.Int - q1);
    else
        q3 += eta*Math.Sign(p.Int - q3);
}

Remarks:

If distribution of your data is strange, you will need to have bigger eta in order to fit to the strange data. But the accuracy will be worse.
If the distribution is strange, but you know the total size of your collection (i.e. N) you can adjust the eta parameter in this way: at the beggining set the eta to be almost equal some large value (i.e. 0.2). As the loop passes, lower the value of eta so when you reach almost the end of the collection, the eta will be almost equal 0 (for example, in loop compute it like that: eta = 0.2 - 0.2*(i/N);

Gacek 2010-05-25 14:45:37

ansaurus

tags:

views:

answers:

incremental way of counting quantiles for large set of data

related questions