views:

93

answers:

4

An rpc server is given which receives millions of requests a day. Each request i takes processing time Ti to get processed. We want to find the 65th percentile processing time (when processing times are sorted according to their values in increasing order) at any moment. We cannot store processing times of all the requests of the past as the number of requests is very large. And so the answer need not be exact 65th percentile, you can give some approximate answer i.e. processing time which will be around the exact 65th percentile number.

Hint: Its something to do how a histogram (i.e. an overview) is stored for a very large data without storing all of data.

A: 

you will need to store a running sum and a total count.

then check out standard deviation calculations.

Randy
-1 really? i do exactly this - and it works like a charm...
Randy
This approach will only work if the data is normally (i.e. gaussian) distributed.
eglaser
A: 

Take one day's data. Use it to figure out what size to make your buckets (say one day's data shows that the vast majority (95%?) of your data is within 0.5 seconds of 1 second (ridiculous values, but hang in)

To get 65th percentile, you'll want at least 20 buckets in that range, but be generous, and make it 80. So you divide your 1 second window (-0.5 seconds to +0.5 seconds) into 80 buckets by making each 1/80th of a second wide.

Each bucket is 1/80th of 1 second. Make bucket 0 be (center - deviation) = (1 - 0.5) = 0.5 to itself + 1/80th of a second. Bucket 1 is 0.5+1/80th - 0.5 + 2/80ths. Etc.

For every value, find out which bucket it falls in, and increment a counter for that bucket.

To find 65th percentile, get the total count, and walk the buckets from zero until you get to 65% of that total.

Whenever you want to reset, set the counters all to zero.

If you always want to have good data available, keep two of these, and alternate resetting them, using the one you reset least recently as having more useful data.

Slartibartfast
A: 

Use an updown filter:

if q < x:
    q += .01 * (x - q)  # up a little
else:
    q += .005 * (x - q)  # down a little

Here a quantile estimator q tracks the x stream, moving a little towards each x. If both factors were .01, it would move up as often as down, tracking the 50 th percentile. With .01 up, .005 down, it floats up, 67 th percentile; in general, it tracks the up / (up + down) th percentile. Bigger up/down factors track faster but noisier -- you'll have to experiment on your real data.

(I have no idea how to analyze updowns, would appreciate a link.)

The updown() below works on long vectors X, Q in order to plot them: alt text

#!/usr/bin/env python
from __future__ import division
import sys
import numpy as np
import pylab as pl

def updown( X, Q, up=.01, down=.01 ):
    """ updown filter: running ~ up / (up + down) th percentile
        here vecs X in, Q out to plot
    """
    q = X[0]
    for j, x in np.ndenumerate(X):
        if q < x:
            q += up * (x - q)  # up a little
        else:
            q += down * (x - q)  # down a little
        Q[j] = q
    return q

#...............................................................................
if __name__ == "__main__":

    N = 1000
    up = .01
    down = .005
    plot = 0
    seed = 1
    exec "\n".join( sys.argv[1:] )  # python this.py N= up= down=
    np.random.seed(seed)
    np.set_printoptions( 2, threshold=100, suppress=True )  # .2f

    title = "updown random.exponential: N %d  up %.2g  down %.2g" % (N, up, down)
    print title
    X = np.random.exponential( size=N )
    Q = np.zeros(N)
    updown( X, Q, up=up, down=down )
        # M = np.zeros(N)
        # updown( X, M, up=up, down=up )
    print "last 10 Q:", Q[-10:]
    if plot:
        fig = pl.figure( figsize=(8,3) )
        pl.title(title)
        x = np.arange(N)
        pl.plot( x, X, "," )
        pl.plot( x, Q )
        pl.ylim( 0, 2 )
        png = "updown.png"
        print >>sys.stderr, "writing", png
        pl.savefig( png )
        pl.show()
Denis
Looks to me like updown is tracking some biased variant of the mean, not the median.
Eamon Nerbonne
A: 

An easier way to get the value that represents a given percentile of a list or array is the scoreatpercentile function in the scipy.stats module.

>>>import scipy.stats as ss
>>>ss.scoreatpercentile(v,65)

there's a sibling percentileofscore to return the percentile given the value

Richard Careaga