views:

565

answers:

6

Here is a fun one: I need to generate random x/y pairs that are correlated at a given value of Pearson product moment correlation coefficient, or Pearson r. You can imagine this as two arrays, array X and array Y, where the values of array X and array Y must be re-generated, re-ordered or transformed until they are correlated with each other at a given level of Pearson r. Here is the kicker: Array X and Array Y must be uniform distributions.

I can do this with a normal distribution, but transforming the values without skewing the distribution has me stumped. I tried re-ordering the values in the arrays to increase the correlation, but I will never get arrays correlated at 1.00 or -1.00 just by sorting.

Any ideas?

--

here is the AS3 code for random correlated gaussians, to get the wheels turning:

public static function nextCorrelatedGaussians(r:Number):Array{             
         var d1:Number;
         var d2:Number;
         var n1:Number;
         var n2:Number;
         var lambda:Number;
         var r:Number;
         var arr:Array = new Array();
         var isNeg:Boolean; 

        if (r<0){
            r *= -1;
              isNeg=true;
        }            
        lambda= (   (r*r)  -  Math.sqrt(  (r*r) - (r*r*r*r)  )     )   /   ((  2*r*r ) - 1  );

        n1 = nextGaussian();
        n2 = nextGaussian();           
        d1 = n1;            
        d2 = ((lambda*n1) + ((1-lambda)*n2)) / Math.sqrt( (lambda*lambda) + (1-lambda)*(1-lambda));

        if (isNeg) {d2*= -1}           
        arr.push(d1);
        arr.push(d2);
        return arr;
    }
+1  A: 

start with the model y = x + e where e is the error (a normal random variable). e should have a mean of 0 and variance k.

long story short, you can write a formula for the expected value of the Pearson in terms of k, and solve for k. note, you cannot randomly generate data with the Pearson exactly equal to a specific value, only with the expected Pearson of a specific value.

i'll try to come back and edit this post to include a closed form solution when i have access to some paper.

EDIT: ok, i have a hand-wavy solution that is probably correct (but will require testing to confirm). for now, assume desired Pearson = p > 0 (you can figure out the p < 0 case). like i mentioned earlier, set your model for Y = X + E (X is uniform, E is normal).

  1. sample to get your x's
  2. compute var(x)
  3. the variance of E should be: (1/(rsd(x)))^2 - var(x)
  4. generate your y's based on your x's and sample from your normal random variable E

for p < 0, set Y = -X + E. proceed accordingly.

basically, this follows from the definition of Pearson: cov(x,y)/var(x)*var(y). when you add noise to the x's (Y = X + E), the expected covariance cov(x,y) should not change from that with no noise. the var(x) does not change. the var(y) is the sum of var(x) and var(e), hence my solution.

SECOND EDIT: ok, i need to read definitions better. the definition of Pearson is cov(x, y)/(sd(x)sd(y)). from that, i think the true value of var(E) should be (1/(rsd(x)))^2 - var(x). see if that works.

twolfe18
Good point about error -- so far I've been defining an acceptable error range for Pearson (usually +- 0.005) and regenerating values until the Pearson of the entire set of x/y values falls within the error range.
Gideon
Brilliant idea, I will try this now and see if it works.
Gideon
maybe i am misunderstanding, but it seems to me that Y is not uniformly distributed. it's the sum of two values, one of which (X) is uniformly distributed and the other of which (E) is normally distributed. this is not easy to fix - it's hard to think of a distribution when, when added to a uniform distribution, gives another uniform distribution (the only example i can think of is a constant).
andrew cooke
@andrew cookie, Y is uniformly distributed! you can have two uniformly distributed variables that are correlated (generalization of Pearson), it all depends on the conditional distribution Y|X.
twolfe18
Are you sure about step 3? 1/(p*var(x)) - var(x) is yielding very large values for the variance of E
Gideon
If Y = X + E, where X is uniform and E is normal, there is no way Y is uniform.
Mathias
@twolfe - i agree you can have two correlated uniform variables. that is not my objection. my objection is that you are trying to generate a uniformly distributed value by adding a normally distributed value to a uniformly distributed value.
andrew cooke
@mathias - you're right, Y will not really be uniform, it will be somewhere between uniform and normal, depending on ratio of variance in X versus in E. i was thinking of the case where Pearson is close to 1 and the variance in E is small compared to that in X (which would make Y approximately uniform).
twolfe18
@andrew cooke - my goal is not to get Y to be uniform per se. my only goal is to satisfy the expected Pearson requirement. the only part of my method that might require Y to be uniform is the assumption that cov(x, y) = 1 (this is the numerator in 1/(p*var(x)) - var(x)). i think when Pearson is close to 1 this is an ok assumption to make. if you don't make this assumption, i think the complexity of the math would skyrocket, but thats just my 2 cents.
twolfe18
not sure i follow, but the math is pretty easy if x and y can both be *normally* distributed. i can try to explain if you want.
andrew cooke
oh, sorry, you already have that. i don't understand what you want.
andrew cooke
@ twolfe, quoted from the original question: "Here is the kicker: Array X and Array Y must be uniform distributions."
Mathias
+1  A: 

Here is an implementation of of twolfe18's algorithm written in Actionscript 3:

for (var j:int=0; j < size; j++) {
 xValues[i]=Math.random());
}
var varX:Number = Util.variance(xValues);
var varianceE:Number = 1/(r*varX) - varX;

for (var i:int=0; i < size; i++) {
 yValues[i] = xValues[i] + boxMuller(0, Math.sqrt(varianceE));
}

boxMuller is just a method that generates a random Gaussian with the arguments (mean, stdDev). size is the size of the distribution.

Sample output

Target p: 0.8
Generated p: 0.04846346291280387
variance of x distribution: 0.0707786253165176
varianceE: 17.589920412141158

As you can see I'm still a ways off. Any suggestions?

Gideon
you said that y had to be uniformly distributed. can you not see that y in the code above cannot be uniform?the reason this is easy for gaussians is because you can generate two independent values and then introduce an arbitrary correlation by creating two new values from a linear combination - effectively a change of coordinate system. that is possible with normally distributed values because adding two normal distributions gives another normal distribution. that is not the case for uniformly distributed values.
andrew cooke
see my second edit...
twolfe18
+1  A: 

This apparently simple question has been messing up with my mind since yesterday evening! I looked for the topic of simulating distributions with a dependency, and the best I found is this: simulate dependent random variables. The gist of it is, you can easily simulate 2 normals with given correlation, and they outline a method to transform these non-independent normals, but this won't preserve correlation. The correlation of the transform will be correlated, so to speak, but not identical. See the paragraph "Rank correlation coefficents".

Edit: from what I gather from the second part of the article, the copula method would allow you to simulate / generate random variables with rank correlation.

Mathias
+1  A: 

To get a correlation of 1 both X and Y should be the same, so copy X to Y and you have a correlation of 1. To get a -1 correlation, make Y = 1 - X. (assuming X values are [0,1])

Peter Lawrey
A: 

A strange problem demands a strange solution -- here is how I solved it.

-Generate array X

-Clone array X to Create array Y

-Sort array X (you can use whatever method you want to sort array X -- quicksort, heapsort anything stable.)

-Measure the starting level of pearson's R with array X sorted and array Y unsorted.

WHILE the correlation is outside of the range you are hoping for

   IF the correlation is to low
         run one iteration of CombSort11 on array Y then recheck correlation
   ELSE IF the correlation is too high
         randomly swap two values and recheck correlation

And thats it! Combsort is the real key, it has the effect of increasing the correlation slowly and steadily. Check out Jason Harrison's demo to see what I mean. To get a negative correlation you can invert the sort or invert one of the arrays after the whole process is complete.

Here is my implementation in AS3:

public static function nextReliableCorrelatedUniforms(r:Number, size:int, error:Number):Array {
  var yValues:Array = new Array;
  var xValues:Array = new Array;
  var coVar:Number = 0;
  for (var e:int=0; e < size; e++) { //create x values     
   xValues.push(Math.random());
 }
 yValues = xValues.concat();
 if(r != 1.0){
  xValues.sort(Array.NUMERIC);
 }
 var trueR:Number = Util.getPearson(xValues, yValues);

  while(Math.abs(trueR-r)>error){
   if (trueR < r-error){ // combsort11 for y  
    var gap:int = yValues.length;
       var swapped:Boolean = true; 
       while (trueR <= r-error) {
           if (gap > 1) {
               gap = Math.round(gap / 1.3);
           }
           var i:int = 0;
           swapped = false;
           while (i + gap < yValues.length  &&  trueR <= r-error) {
               if (yValues[i] > yValues[i + gap]) {
                   var t:Number = yValues[i];
                   yValues[i] = yValues[i + gap];
                   yValues[i + gap] = t;
                   trueR = Util.getPearson(xValues, yValues)
                   swapped = true;
               }
               i++;
           }
       }
   }

   else { // decorrelate
    while (trueR >= r+error) {
     var a:int = Random.randomUniformIntegerBetween(0, size-1);
     var b:int = Random.randomUniformIntegerBetween(0, size-1);
                 var temp:Number = yValues[a];
                 yValues[a] = yValues[b];
                 yValues[b] = temp;
     trueR = Util.getPearson(xValues, yValues)
      }    
   }
  }
  var correlates:Array = new Array;
  for (var h:int=0; h < size; h++) {
   var pair:Array = new Array(xValues[h], yValues[h]);
   correlates.push(pair);}
     return correlates;
 }
Gideon
+1  A: 

I ended up writing a short paper on this

It doesn't include your sorting method (although in practice I think it's similar to my first method, in a roundabout way), but does describe two ways that don't require iteration.

andrew cooke
Wow, this is excellent work! One of the limitations for my use is that the distribution must not have a distinct upper or lower boundary, so I was unable to use any geometric transformation. I do wonder if there is a non iterative solution...
Gideon