views:

123

answers:

4

I need to count the quantiles for a large set of data.

Let's assume we can get the data only through some portions (i.e. one row of a large matrix). To count the Q3 quantile one need to get all the portions of the data and store it somewhere, then sort it and count the quantile:

List<double> allData = new List<double>();
foreach(var row in matrix) // this is only example. In fact the portions of data are not rows of some matrix
   {
    allData.AddRange(row);
   }

allData.Sort();
double p = 0.75*allData.Count;
int idQ3 = (int)Math.Ceiling(p) - 1;
double Q3 = allData[idQ3];

Now, I would like to find a way of counting this without storing the data in some separate variable.
The best solution would be to count some parameters od mid-results for first row and then adjust it step by step for next rows.

Note:

  • These datasets are really big (ca 5000 elements in each row)
  • The Q3 can be estimated, it doesn't have to be an exact value.
  • I call the portions of data "rows", but they can have different leghts! Usually it varies not so much (+/- few hundred samples) but it varies!

This question is similar to this one: http://stackoverflow.com/questions/1058813/on-line-iterator-algorithms-for-estimating-statistical-median-mode-skewness But I need to count quantiles. ALso there are few articles in this topic, i.e.:
http://web.cs.wpi.edu/~hofri/medsel.pdf
http://portal.acm.org/citation.cfm?id=347195&amp;dl

But before I would try to implement these, I wanted to ask you if there are maybe any other, qucker ways of counting the 0.25/0.75 quantiles?

A: 
  1. Only retrieve the data you really need -- i.e., whatever value(s) is/are being used as the key for sorting, not everything else associated with it.
  2. You can probably use Tony Hoare's Select algorithm to find your quantile more quickly than sorting all the data.
Jerry Coffin
A: 

If your data has a Gaussian distribution, you can estimate the quantiles from the standard deviation. I assume your data isn't Gaussian distributed or you'd just be using the SD anyway.

If you can pass through your data twice, I'd do the following:

  • First pass, compute the max, min, SD and mean.
  • Second pass, divide the range [min,max] into some number of buckets (e.g. 100); do the same for (mean - 2*SD,mean + 2*SD) (with extra buckets for outliers). Then run through the data again, tossing numbers into these buckets.
  • Count buckets until you are at 25% and 75% of the data. If you want to get extra-fancy, you can interpolate between bucket values. (I.e. if you need 10% of a bucket to hit your 25th quantile, assume the value is 10% of the way from the low bound to the upper bound.)

This should give you a pretty good linear-time algorithm that works okay for most sets of not-entirely-perverse data.

Rex Kerr
A: 

I second the idea of using buckets. Don't limit yourself to 100 buckets - might as well use 1 million. The tricky part is to pick your bucket ranges so that everything doesn't end up in a single bucket. Probably the best way to estimate your bucket ranges is to take a reasonable random sample of your data, compute the 10% and 90% quantiles using the simple sort algorithm, then generate equal-sized buckets to fill that range. It isn't perfect, but if your data isn't from a super-weird distribution, it should work.

If you can't do random samples, you're in more trouble. You can pick an initial bucketing guess based on your expected data distribution, then while working through your data if any bucket (typically the first or last bucket) gets overfull, start over again with a new bucket range.

Keith Randall
A: 

Inspired by this answer I created a method that estimates the quantiles quite good. It is approximation close enough for my purposes.

The idea is following: the 0.75 quantile is in fact a median of all values that lies above the global median. And respectively, 0.25 quantile is a median of all values below the global median.

So if we can approximate the median, we can in similar way approximate the quantiles.

double median = 0;
double q1 = 0;
double q3 = 0;
double eta = 0.005;

foreach( var value in listOfValues) // or stream, or any other large set of data...
{
    median += eta * Math.Sign(p.Int - median);
}
// Second pass. We know the median, so we can count the quantiles.
foreach(var value in listOfValues)
{ 
    if(p.Int < median)
        q1 += eta*Math.Sign(p.Int - q1);
    else
        q3 += eta*Math.Sign(p.Int - q3);
}

Remarks:

  • If distribution of your data is strange, you will need to have bigger eta in order to fit to the strange data. But the accuracy will be worse.
  • If the distribution is strange, but you know the total size of your collection (i.e. N) you can adjust the eta parameter in this way: at the beggining set the eta to be almost equal some large value (i.e. 0.2). As the loop passes, lower the value of eta so when you reach almost the end of the collection, the eta will be almost equal 0 (for example, in loop compute it like that: eta = 0.2 - 0.2*(i/N);
Gacek