ansaurus

Question

Answer 1

+1 A:

start with the model y = x + e where e is the error (a normal random variable). e should have a mean of 0 and variance k.

long story short, you can write a formula for the expected value of the Pearson in terms of k, and solve for k. note, you cannot randomly generate data with the Pearson exactly equal to a specific value, only with the expected Pearson of a specific value.

i'll try to come back and edit this post to include a closed form solution when i have access to some paper.

EDIT: ok, i have a hand-wavy solution that is probably correct (but will require testing to confirm). for now, assume desired Pearson = p > 0 (you can figure out the p < 0 case). like i mentioned earlier, set your model for Y = X + E (X is uniform, E is normal).

sample to get your x's
compute var(x)
the variance of E should be: (1/(rsd(x)))^2 - var(x)
generate your y's based on your x's and sample from your normal random variable E

for p < 0, set Y = -X + E. proceed accordingly.

basically, this follows from the definition of Pearson: cov(x,y)/var(x)*var(y). when you add noise to the x's (Y = X + E), the expected covariance cov(x,y) should not change from that with no noise. the var(x) does not change. the var(y) is the sum of var(x) and var(e), hence my solution.

SECOND EDIT: ok, i need to read definitions better. the definition of Pearson is cov(x, y)/(sd(x)sd(y)). from that, i think the true value of var(E) should be (1/(rsd(x)))^2 - var(x). see if that works.

twolfe18 2009-11-11 21:19:53

Good point about error -- so far I've been defining an acceptable error range for Pearson (usually +- 0.005) and regenerating values until the Pearson of the entire set of x/y values falls within the error range.

Gideon 2009-11-11 21:35:18

Brilliant idea, I will try this now and see if it works.

Gideon 2009-11-11 22:52:38

maybe i am misunderstanding, but it seems to me that Y is not uniformly distributed. it's the sum of two values, one of which (X) is uniformly distributed and the other of which (E) is normally distributed. this is not easy to fix - it's hard to think of a distribution when, when added to a uniform distribution, gives another uniform distribution (the only example i can think of is a constant).

andrew cooke 2009-11-11 23:34:30

@andrew cookie, Y is uniformly distributed! you can have two uniformly distributed variables that are correlated (generalization of Pearson), it all depends on the conditional distribution Y|X.

twolfe18 2009-11-12 02:29:05

Are you sure about step 3? 1/(p*var(x)) - var(x) is yielding very large values for the variance of E

Gideon 2009-11-12 05:36:52

If Y = X + E, where X is uniform and E is normal, there is no way Y is uniform.

Mathias 2009-11-12 07:59:56

@twolfe - i agree you can have two correlated uniform variables. that is not my objection. my objection is that you are trying to generate a uniformly distributed value by adding a normally distributed value to a uniformly distributed value.

andrew cooke 2009-11-12 08:50:23

@mathias - you're right, Y will not really be uniform, it will be somewhere between uniform and normal, depending on ratio of variance in X versus in E. i was thinking of the case where Pearson is close to 1 and the variance in E is small compared to that in X (which would make Y approximately uniform).

twolfe18 2009-11-12 15:27:31

@andrew cooke - my goal is not to get Y to be uniform per se. my only goal is to satisfy the expected Pearson requirement. the only part of my method that might require Y to be uniform is the assumption that cov(x, y) = 1 (this is the numerator in 1/(p*var(x)) - var(x)). i think when Pearson is close to 1 this is an ok assumption to make. if you don't make this assumption, i think the complexity of the math would skyrocket, but thats just my 2 cents.

twolfe18 2009-11-12 15:37:49

not sure i follow, but the math is pretty easy if x and y can both be *normally* distributed. i can try to explain if you want.

andrew cooke 2009-11-12 18:33:52

oh, sorry, you already have that. i don't understand what you want.

andrew cooke 2009-11-12 18:40:19

@ twolfe, quoted from the original question: "Here is the kicker: Array X and Array Y must be uniform distributions."

Mathias 2009-11-12 19:13:48

Answer 2

+1 A:

Here is an implementation of of twolfe18's algorithm written in Actionscript 3:

for (var j:int=0; j < size; j++) {
 xValues[i]=Math.random());
}
var varX:Number = Util.variance(xValues);
var varianceE:Number = 1/(r*varX) - varX;

for (var i:int=0; i < size; i++) {
 yValues[i] = xValues[i] + boxMuller(0, Math.sqrt(varianceE));
}

boxMuller is just a method that generates a random Gaussian with the arguments (mean, stdDev). size is the size of the distribution.

Sample output

Target p: 0.8
Generated p: 0.04846346291280387
variance of x distribution: 0.0707786253165176
varianceE: 17.589920412141158

As you can see I'm still a ways off. Any suggestions?

Gideon 2009-11-12 05:51:42

you said that y had to be uniformly distributed. can you not see that y in the code above cannot be uniform?the reason this is easy for gaussians is because you can generate two independent values and then introduce an arbitrary correlation by creating two new values from a linear combination - effectively a change of coordinate system. that is possible with normally distributed values because adding two normal distributions gives another normal distribution. that is not the case for uniformly distributed values.

andrew cooke 2009-11-12 08:48:25

see my second edit...

twolfe18 2009-11-12 15:48:36

Answer 3

+1 A:

This apparently simple question has been messing up with my mind since yesterday evening! I looked for the topic of simulating distributions with a dependency, and the best I found is this: simulate dependent random variables. The gist of it is, you can easily simulate 2 normals with given correlation, and they outline a method to transform these non-independent normals, but this won't preserve correlation. The correlation of the transform will be correlated, so to speak, but not identical. See the paragraph "Rank correlation coefficents".

Edit: from what I gather from the second part of the article, the copula method would allow you to simulate / generate random variables with rank correlation.

Mathias 2009-11-12 18:15:53

Answer 4

+1 A:

To get a correlation of 1 both X and Y should be the same, so copy X to Y and you have a correlation of 1. To get a -1 correlation, make Y = 1 - X. (assuming X values are [0,1])

Peter Lawrey 2009-11-17 20:12:08

Answer 5

A:

A strange problem demands a strange solution -- here is how I solved it.

-Generate array X

-Clone array X to Create array Y

-Sort array X (you can use whatever method you want to sort array X -- quicksort, heapsort anything stable.)

-Measure the starting level of pearson's R with array X sorted and array Y unsorted.

WHILE the correlation is outside of the range you are hoping for

   IF the correlation is to low
         run one iteration of CombSort11 on array Y then recheck correlation
   ELSE IF the correlation is too high
         randomly swap two values and recheck correlation

And thats it! Combsort is the real key, it has the effect of increasing the correlation slowly and steadily. Check out Jason Harrison's demo to see what I mean. To get a negative correlation you can invert the sort or invert one of the arrays after the whole process is complete.

Here is my implementation in AS3:

public static function nextReliableCorrelatedUniforms(r:Number, size:int, error:Number):Array {
  var yValues:Array = new Array;
  var xValues:Array = new Array;
  var coVar:Number = 0;
  for (var e:int=0; e < size; e++) { //create x values     
   xValues.push(Math.random());
 }
 yValues = xValues.concat();
 if(r != 1.0){
  xValues.sort(Array.NUMERIC);
 }
 var trueR:Number = Util.getPearson(xValues, yValues);

  while(Math.abs(trueR-r)>error){
   if (trueR < r-error){ // combsort11 for y  
    var gap:int = yValues.length;
       var swapped:Boolean = true; 
       while (trueR <= r-error) {
           if (gap > 1) {
               gap = Math.round(gap / 1.3);
           }
           var i:int = 0;
           swapped = false;
           while (i + gap < yValues.length  &&  trueR <= r-error) {
               if (yValues[i] > yValues[i + gap]) {
                   var t:Number = yValues[i];
                   yValues[i] = yValues[i + gap];
                   yValues[i + gap] = t;
                   trueR = Util.getPearson(xValues, yValues)
                   swapped = true;
               }
               i++;
           }
       }
   }

   else { // decorrelate
    while (trueR >= r+error) {
     var a:int = Random.randomUniformIntegerBetween(0, size-1);
     var b:int = Random.randomUniformIntegerBetween(0, size-1);
                 var temp:Number = yValues[a];
                 yValues[a] = yValues[b];
                 yValues[b] = temp;
     trueR = Util.getPearson(xValues, yValues)
      }    
   }
  }
  var correlates:Array = new Array;
  for (var h:int=0; h < size; h++) {
   var pair:Array = new Array(xValues[h], yValues[h]);
   correlates.push(pair);}
     return correlates;
 }

Gideon 2009-11-19 20:35:33

Answer 6

+1 A:

I ended up writing a short paper on this

It doesn't include your sorting method (although in practice I think it's similar to my first method, in a roundabout way), but does describe two ways that don't require iteration.

andrew cooke 2009-11-22 22:33:23

Wow, this is excellent work! One of the limitations for my use is that the distribution must not have a distinct upper or lower boundary, so I was unable to use any geometric transformation. I do wonder if there is a non iterative solution...

Gideon 2009-11-24 23:00:47

ansaurus

tags:

views:

answers:

Generating correlated numbers

related questions