views:

115

answers:

4

Ok, so here's my problem. We are looking at purchasing a data set from a company to augment our existing data set. For the purposes of this question, let's say that this data set ranks places with an organic number (meaning that the number assigned to one place has no bearing on the number assigned to another). The technical range is 0 to infinity, but from sample sets that I've seen, it's 0 to 70. Based on the sample, it's most definitely not a uniform distribution (out of 10,000 there are maybe 5 places with a score over 40, 50 with a score over 10, and 1000 with a score over 1). Before we decide to purchase this set, we would like to simulate it so that we can see how useful it may be.

So, to simulate it, I've been thinking about generating a random number for each place (about 150,000 random numbers). But, I also want to keep to the spirit of the data, and keep the distribution relatively the same (or at least reasonably close). I've been racking my brain all day trying to think of a way to do it, and have come up empty.

One thought I had was to square the random number (between 0 and sqrt(70)). But that would favor both less than 1 and larger numbers.

I'm thinking that he real distribution should be hyperbolic in the first quadrant... I'm just blanking on how to turn a linear, even distribution of random numbers into a hyperbolic distribution (If hyperbolic is even what I want in the first place).

Any thoughts?

So, to sum, here's the distribution I would like (approximately):

  • 40 - 70: 0.02% - 0.05%
  • 10 - 40: 0.5% - 1%
  • 1 - 10: 10% - 20%
  • 0 - 1 : Remainder (78.95% - 89.48%)
+2  A: 

Written years ago for PHP4, simply pick your distribution:

<?php

define( 'RandomGaussian',           'gaussian' ) ;          //  gaussianWeightedRandom()
define( 'RandomBell',               'bell' ) ;              //  bellWeightedRandom()
define( 'RandomGaussianRising',     'gaussianRising' ) ;    //  gaussianWeightedRisingRandom()
define( 'RandomGaussianFalling',    'gaussianFalling' ) ;   //  gaussianWeightedFallingRandom()
define( 'RandomGamma',              'gamma' ) ;             //  gammaWeightedRandom()
define( 'RandomGammaQaD',           'gammaQaD' ) ;          //  QaDgammaWeightedRandom()
define( 'RandomLogarithmic10',      'log10' ) ;             //  logarithmic10WeightedRandom()
define( 'RandomLogarithmic',        'log' ) ;               //  logarithmicWeightedRandom()
define( 'RandomPoisson',            'poisson' ) ;           //  poissonWeightedRandom()
define( 'RandomDome',               'dome' ) ;              //  domeWeightedRandom()
define( 'RandomSaw',                'saw' ) ;               //  sawWeightedRandom()
define( 'RandomPyramid',            'pyramid' ) ;           //  pyramidWeightedRandom()
define( 'RandomLinear',             'linear' ) ;            //  linearWeightedRandom()
define( 'RandomUnweighted',         'non' ) ;               //  nonWeightedRandom()



function mkseed()
{
    srand(hexdec(substr(md5(microtime()), -8)) & 0x7fffffff) ;
}   //  function mkseed()




/*
function factorial($in) {
    if ($in == 1) {
        return $in ;
    }
    return ($in * factorial($in - 1.0)) ;
}   //  function factorial()


function factorial($in) {
    $out = 1 ;
    for ($i = 2; $i <= $in; $i++) {
        $out *= $i ;
    }

    return $out ;
}   //  function factorial()
*/




function random_0_1()
{
    //  returns random number using mt_rand() with a flat distribution from 0 to 1 inclusive
    //
    return (float) mt_rand() / (float) mt_getrandmax() ;
}   //  random_0_1()


function random_PN()
{
    //  returns random number using mt_rand() with a flat distribution from -1 to 1 inclusive
    //
    return (2.0 * random_0_1()) - 1.0 ;
}   //  function random_PN()




function gauss()
{
    static $useExists = false ;
    static $useValue ;

    if ($useExists) {
        //  Use value from a previous call to this function
        //
        $useExists = false ;
        return $useValue ;
    } else {
        //  Polar form of the Box-Muller transformation
        //
        $w = 2.0 ;
        while (($w >= 1.0) || ($w == 0.0)) {
            $x = random_PN() ;
            $y = random_PN() ;
            $w = ($x * $x) + ($y * $y) ;
        }
        $w = sqrt((-2.0 * log($w)) / $w) ;

        //  Set value for next call to this function
        //
        $useValue = $y * $w ;
        $useExists = true ;

        return $x * $w ;
    }
}   //  function gauss()


function gauss_ms( $mean,
                   $stddev )
{
    //  Adjust our gaussian random to fit the mean and standard deviation
    //  The division by 4 is an arbitrary value to help fit the distribution
    //      within our required range, and gives a best fit for $stddev = 1.0
    //
    return gauss() * ($stddev/4) + $mean;
}   //  function gauss_ms()


function gaussianWeightedRandom( $LowValue,
                                 $maxRand,
                                 $mean=0.0,
                                 $stddev=2.0 )
{
    //  Adjust a gaussian random value to fit within our specified range
    //      by 'trimming' the extreme values as the distribution curve
    //      approaches +/- infinity
    $rand_val = $LowValue + $maxRand ;
    while (($rand_val < $LowValue) || ($rand_val >= ($LowValue + $maxRand))) {
        $rand_val = floor(gauss_ms($mean,$stddev) * $maxRand) + $LowValue ;
        $rand_val = ($rand_val + $maxRand) / 2 ;
    }

    return $rand_val ;
}   //  function gaussianWeightedRandom()


function bellWeightedRandom( $LowValue,
                             $maxRand )
{
    return gaussianWeightedRandom( $LowValue, $maxRand, 0.0, 1.0 ) ;
}   //  function bellWeightedRandom()


function gaussianWeightedRisingRandom( $LowValue,
                                       $maxRand )
{
    //  Adjust a gaussian random value to fit within our specified range
    //      by 'trimming' the extreme values as the distribution curve
    //      approaches +/- infinity
    //  The division by 4 is an arbitrary value to help fit the distribution
    //      within our required range
    $rand_val = $LowValue + $maxRand ;
    while (($rand_val < $LowValue) || ($rand_val >= ($LowValue + $maxRand))) {
        $rand_val = $maxRand - round((abs(gauss()) / 4) * $maxRand) + $LowValue ;
    }

    return $rand_val ;
}   //  function gaussianWeightedRisingRandom()


function gaussianWeightedFallingRandom( $LowValue,
                                        $maxRand )
{
    //  Adjust a gaussian random value to fit within our specified range
    //      by 'trimming' the extreme values as the distribution curve
    //      approaches +/- infinity
    //  The division by 4 is an arbitrary value to help fit the distribution
    //      within our required range
    $rand_val = $LowValue + $maxRand ;
    while (($rand_val < $LowValue) || ($rand_val >= ($LowValue + $maxRand))) {
        $rand_val = floor((abs(gauss()) / 4) * $maxRand) + $LowValue ;
    }

    return $rand_val ;
}   //  function gaussianWeightedFallingRandom()


function logarithmic($mean=1.0, $lambda=5.0)
{
    return ($mean * -log(random_0_1())) / $lambda ;
}   //  function logarithmic()


function logarithmicWeightedRandom( $LowValue,
                                    $maxRand )
{
    do {
        $rand_val = logarithmic() ;
    } while ($rand_val > 1) ;

    return floor($rand_val * $maxRand) + $LowValue ;
}   //  function logarithmicWeightedRandom()


function logarithmic10( $lambda=0.5 )
{
    return abs(-log10(random_0_1()) / $lambda) ;
}   //  function logarithmic10()


function logarithmic10WeightedRandom( $LowValue,
                                      $maxRand )
{
    do {
        $rand_val = logarithmic10() ;
    } while ($rand_val > 1) ;

    return floor($rand_val * $maxRand) + $LowValue ;
}   //  function logarithmic10WeightedRandom()


function gamma( $lambda=3.0 )
{
    $wLambda = $lambda + 1.0 ;
    if ($lambda <= 8.0) {
        //  Use direct method, adding waiting times
        $x = 1.0 ;
        for ($j = 1; $j <= $wLambda; $j++) {
            $x *= random_0_1() ;
        }
        $x = -log($x) ;
    } else {
        //  Use rejection method
        do {
            do {
                //  Generate the tangent of a random angle, the equivalent of
                //      $y = tan(pi * random_0_1())
                do {
                    $v1 = random_0_1() ;
                    $v2 = random_PN() ;
                } while (($v1 * $v1 + $v2 * $v2) > 1.0) ;
                $y = $v2 / $v1 ;
                $s = sqrt(2.0 * $lambda + 1.0) ;
                $x = $s * $y + $lambda ;
            //  Reject in the region of zero probability
            } while ($x <= 0.0) ;
            //  Ratio of probability function to comparison function
            $e = (1.0 + $y * $y) * exp($lambda * log($x / $lambda) - $s * $y) ;
        //  Reject on the basis of a second uniform deviate
        } while (random_0_1() > $e) ;
    }

    return $x ;
}   //  function gamma()


function gammaWeightedRandom( $LowValue,
                              $maxRand )
{
    do {
        $rand_val = gamma() / 12 ;
    } while ($rand_val > 1) ;

    return floor($rand_val * $maxRand) + $LowValue ;
}   //  function gammaWeightedRandom()


function QaDgammaWeightedRandom( $LowValue,
                                 $maxRand )
{
    return round((asin(random_0_1()) + (asin(random_0_1()))) * $maxRand / pi()) + $LowValue ;
}   //  function QaDgammaWeightedRandom()


function gammaln($in)
{
    $tmp = $in + 4.5 ;
    $tmp -= ($in - 0.5) * log($tmp) ;

    $ser = 1.000000000190015
            + (76.18009172947146 / $in)
            - (86.50532032941677 / ($in + 1.0))
            + (24.01409824083091 / ($in + 2.0))
            - (1.231739572450155 / ($in + 3.0))
            + (0.1208650973866179e-2 / ($in + 4.0))
            - (0.5395239384953e-5 / ($in + 5.0)) ;

    return (log(2.5066282746310005 * $ser) - $tmp) ;
}   //  function gammaln()


function poisson( $lambda=1.0 )
{
    static $oldLambda ;
    static $g, $sq, $alxm ;

    if ($lambda <= 12.0) {
        //  Use direct method
        if ($lambda <> $oldLambda) {
            $oldLambda = $lambda ;
            $g = exp(-$lambda) ;
        }
        $x = -1 ;
        $t = 1.0 ;
        do {
            ++$x ;
            $t *= random_0_1() ;
        } while ($t > $g) ;
    } else {
        //  Use rejection method
        if ($lambda <> $oldLambda) {
            $oldLambda = $lambda ;
            $sq = sqrt(2.0 * $lambda) ;
            $alxm = log($lambda) ;
            $g = $lambda * $alxm - gammaln($lambda + 1.0) ;
        }
        do {
            do {
                //  $y is a deviate from a Lorentzian comparison function
                $y = tan(pi() * random_0_1()) ;
                $x = $sq * $y + $lambda ;
            //  Reject if close to zero probability
            } while ($x < 0.0) ;
            $x = floor($x) ;
            //  Ratio of the desired distribution to the comparison function
            //  We accept or reject by comparing it to another uniform deviate
            //  The factor 0.9 is used so that $t never exceeds 1
            $t = 0.9 * (1.0 + $y * $y) * exp($x * $alxm - gammaln($x + 1.0) - $g) ;
        } while (random_0_1() > $t) ;
    }

    return $x ;
}   //  function poisson()


function poissonWeightedRandom( $LowValue,
                                $maxRand )
{
    do {
        $rand_val = poisson() / $maxRand ;
    } while ($rand_val > 1) ;

    return floor($x * $maxRand) + $LowValue ;
}   //  function poissonWeightedRandom()


function binomial( $lambda=6.0 )
{
}


function domeWeightedRandom( $LowValue,
                             $maxRand )
{
    return floor(sin(random_0_1() * (pi() / 2)) * $maxRand) + $LowValue ;
}   //  function bellWeightedRandom()


function sawWeightedRandom( $LowValue,
                            $maxRand )
{
    return floor((atan(random_0_1()) + atan(random_0_1())) * $maxRand / (pi()/2)) + $LowValue ;
}   //  function sawWeightedRandom()


function pyramidWeightedRandom( $LowValue,
                               $maxRand )
{
    return floor((random_0_1() + random_0_1()) / 2 * $maxRand) + $LowValue ;
}   //  function pyramidWeightedRandom()


function linearWeightedRandom( $LowValue,
                               $maxRand )
{
    return floor(random_0_1() * ($maxRand)) + $LowValue ;
}   //  function linearWeightedRandom()


function nonWeightedRandom( $LowValue,
                            $maxRand )
{
    return rand($LowValue,$maxRand+$LowValue-1) ;
}   //  function nonWeightedRandom()




function weightedRandom( $Method,
                         $LowValue,
                         $maxRand )
{
    switch($Method) {
        case RandomGaussian         :
            $rVal = gaussianWeightedRandom( $LowValue, $maxRand ) ;
            break ;
        case RandomBell             :
            $rVal = bellWeightedRandom( $LowValue, $maxRand ) ;
            break ;
        case RandomGaussianRising   :
            $rVal = gaussianWeightedRisingRandom( $LowValue, $maxRand ) ;
            break ;
        case RandomGaussianFalling  :
            $rVal = gaussianWeightedFallingRandom( $LowValue, $maxRand ) ;
            break ;
        case RandomGamma            :
            $rVal = gammaWeightedRandom( $LowValue, $maxRand ) ;
            break ;
        case RandomGammaQaD         :
            $rVal = QaDgammaWeightedRandom( $LowValue, $maxRand ) ;
            break ;
        case RandomLogarithmic10    :
            $rVal = logarithmic10WeightedRandom( $LowValue, $maxRand ) ;
            break ;
        case RandomLogarithmic      :
            $rVal = logarithmicWeightedRandom( $LowValue, $maxRand ) ;
            break ;
        case RandomPoisson          :
            $rVal = poissonWeightedRandom( $LowValue, $maxRand ) ;
            break ;
        case RandomDome             :
            $rVal = domeWeightedRandom( $LowValue, $maxRand ) ;
            break ;
        case RandomSaw              :
            $rVal = sawWeightedRandom( $LowValue, $maxRand ) ;
            break ;
        case RandomPyramid          :
            $rVal = pyramidWeightedRandom( $LowValue, $maxRand ) ;
            break ;
        case RandomLinear           :
            $rVal = linearWeightedRandom( $LowValue, $maxRand ) ;
            break ;
        default                     :
            $rVal = nonWeightedRandom( $LowValue, $maxRand ) ;
            break ;
    }

    return $rVal;
}

?>
Mark Baker
Thanks for the code. However, I tried looking up all of the methods you provided, and I failed to see any that seemed to fit my model. Stats was never my strong point. If you could point out a model that you think may fit, I'd be all ears... Thanks...
ircmaxell
One option would be to try generating a series of values and plotting them on a graph using each of the different pre-defined distributions to see what the curves look like. Wikipedia also has extensive entries on a many of those distributions..... though for what you describe (if I've interpreted it correctly) try gaussianWeightedRisingRandom if you want more upper range values, or gaussianWeightedFallingRandom if you want more lower range values... though poisson is often a useful method for many real-world situations
Mark Baker
Ok, I tried each. The GaussianWeightedFallingRandom is closest, but it still isn't falling off nearly fast enough (200 instead of 5 over 40, 5700 instead of 50 over 10, and 9500 instead of 1000 over 1. I've tried csch and it looks much closer (as it matches the high ranges), but falls off too fast in the middle. Thoughts?
ircmaxell
In that case, look to use gaussianWeightedRandom( $LowValue, $maxRand, $mean, $stddev ) but setting your own values for mean and standard deviation, or modify the call to gauss() in GaussianWeightedFallingRandom() to call gauss_ms( $mean, $stddev ) with your own values for mean and standard deviation. It may take some experimentation.... but look at the wikipedia page to see how changes to these parameters affect the shape of the curve
Mark Baker
+3  A: 

Look at distributions used in reliability analysis - they tend to have these long tails. A relatively simply possibility is the Weibull distribution with P(X>x)=exp[-(x/b)^a].

Fitting your values as P(X>1)=0.1 and P(X>10)=0.005, I get a=0.36 and b=0.1. This would imply that P(X>40)*10000=1.6, which is a bit too low, but P(X>70)*10000=0.2 which is reasonable.

EDIT Oh, and to generate a Weibull-distributed random variable from a uniform(0,1) value U, just calculate b*[-log(1-u)]^(1/a). This is the inverse function of 1-P(X>x) in case I miscalculated something.

Aniko
Wow, that looks almost identical to the result set I'm after (4 > 40, 60 > 10, 1030 > 1). Excellent! Thanks!
ircmaxell
A: 

This naive way of doing it will most probably skew the distribution in some way I can't see right now. The idea is simply to iterate over your first dataset, sorted and as pairs. Then randomize 15 new numbers inbetween each pair to get the new array.

Ruby example, since I don't speak much PHP. Hopefully such a simple idea should be easy to translate into PHP.

numbers=[0.1,0.1,0.12,0.13,0.15,0.17,0.3,0.4,0.42,0.6,1,3,5,7,13,19,27,42,69]
more_numbers=[]
numbers.each_cons(2) { |a,b| 15.times { more_numbers << a+rand()*(b-a) } }
more_numbers.sort!
Jonas Elfström
A: 

The easiest (but not very efficient) way to generate random numbers that follow a given distribution is a technique called Von Neumann Rejection.

The simple explination of the technique is this. Create a box that completely encloses your distribution. (lets call your distribution f) Then pick a random point (x,y) in the box. If y < f(x), then use x as a random number. If y > f(x), then discard both x and y and pick another point. Continue until you have a sufficient amount of values to use. The values of x that you don't reject will be distributed according to f.

demeteloaf
Unless I'm mistaken, isn't that just getting random points under a curve defined by `f(x)`? Considering that my curve looks hyperbolic, the greatest density of points would be around the origin, so wouldn't the generated numbers be skewed towards the middle of the bounded box that's created between the origin and the vertex (and hence not favor lower numbers as I need it to)?
ircmaxell