ansaurus

Question

How to generate correlated binary variables

Answer 1

A:

Express the distribution x_i as a linear combination of some independent basis distributions f_j: x_i = a_i1f₁ + a_i2f₂ + ... . Let us constrain f_j to be independent variables uniformly distributed in 0..1 or in {0,1} (discrete). Let us now express everything we know in matrix form:

Let X be the vector (x1, x2, .., xn)
Let A be the matrix (a_ij) of dimension (k,n) (n rows, k columns)
Let F be the vector (f1, f2, .., fk) 
Let P be the vector (p1, p2, .., pn)
Let R be the matrix (E[x_i,x_j]) for i,j=1..n
Definition of the X distribution: X = A * F
Constraint on the mean of individual X variables: P = A * (1 ..k times.. 1)
Correlation constraint: AT*A = 3R or 2R in the discrete case (because E[x_i x_j] = 
  E[(a_i1*f_1 + a_i2*f_2 + ...)*(a_j1*f_1 + a_j2*f_2 + ...)] =
  E[sum over p,q: a_ip*f_p*a_jq*f_q] = (since for p/=q holds E[f_p*f_q]=0)
  E[sum over p: a_ip*a_jp*f_p^2] =
  sum over p: a_ip*a_jp*E[f_p^2] = (since E[f_p^2] = 1/3 or 1/2 for the discrete case)
  sum over p: 1/3 or 1/2*a_ip*a_jp
And the vector consisting of those sums over p: a_ip*a_jp is precisely AT*A.

Now you need to solve the two equations:

AT*A      = 3R (or 2R in the discrete case)
A*(1...1) = P

Solution of the first equation corresponds to finding the square root of the matrix 3R or 2R. See for example http://en.wikipedia.org/wiki/Cholesky_factorization and generally http://en.wikipedia.org/wiki/Square_root_of_a_matrix . Something also should be done about the second one :)

I ask mathematicians around to correct me, because I may very well have mixed AT*A with A*AT or done something even more wrong.

To generate a value of x_i as a linear mixture of the basis distributions, use a two-step process: 1) use a uniform random variable to choose one of the basis distributions, weighted with corresponding probability, 2) generate a result using the chosen basis distribution.

jkff 2010-03-14 14:11:53

Unfortunately, the continuous -> discrete transition is often the hardest part. For example, the problem of finding a Hadamard matrix gets a _lot_ easier if complex entries are allowed. I don't see any way to discretize your solution within the given framework.

2010-03-14 16:40:30

Why should it be hard? The solution just depends on the resulting distribution being a linear mixture of basis distributions but I don't see how it depends on their continuity. Is it that discrete distributions can't easily be linearly mixed?

jkff 2010-03-14 18:50:09

In this case it's the fact that your continuous distributions aren't convex combinations of Bernoulli trials.

2010-03-14 23:35:42

I'm sorry, I don't quite understand. I know what is a convex combination and what are Bernoulli trials, but still: I've edited my post; does not the process described in the last paragraph give a correct result? If so, could you point me to some sources expanding your point? (anyways, probably I should just implement the stuff I described and see if it works)

jkff 2010-03-15 08:44:06

The problem is that A may have entries that are not between 0 and 1.

2010-03-15 14:09:56

Answer 2

A:

The brute force solution is to express the constraints of the problem as a linear program with 2^N variables pr(w) where w ranges over all binary strings of length N. First the constraint that pr be a probability distribution:

for all w: 0 <= pr(w) <= 1
sum_w pr(w) = 1

Second, the constraint that the expectation of each variable be p:

for all i: sum_{w such that w[i] = 1} pr(w) = p

Third, the covariance constraints:

for all i < j: sum_{w such that w[i] = w[j] = 1} pr(w) = const * |j - i|^alpha - p^2

This is very slow, but a cursory literature search turned up nothing better. If you decide to implement it, here are some LP solvers with Python bindings: http://wiki.python.org/moin/NumericAndScientific/Libraries

2010-03-14 17:49:28

I don't know linear programming, but I do not see how this will work. Any configuration of the binary series will have a non-zero probability. Is it possible to calculate the probability of any configuration?

jonalm 2010-03-15 08:31:19

Yes, if the problem is solvable, then the LP routine will give you the probability of each of the 2^N configurations.

2010-03-15 14:03:53

Answer 3

+1 A:

A quick search at RSeek reveals that R has packages

to do this.

Dirk Eddelbuettel 2010-03-15 13:25:26

That was my first idea too, but I doubt they could handle N=1000-10000 per jonalm's statement in the comments.

Aniko 2010-03-15 17:36:39

I dont know R, but this seems like a reason to learn it :) Thanks.

jonalm 2010-03-16 04:02:57

Answer 4

+2 A:

Jason S 2010-03-15 14:06:20

It's actually |i-j|^-\alpha; the solution for \alpha^|i-j| is in the literature.

2010-03-15 14:28:06

hmmm... |i-j|^-alpha has no solution for i=j. Are we sure the OP did not mis-state?

Jason S 2010-03-15 15:21:51

It can be Corr[xi xj] = const x|i-j|^-alpha for i != j, or Corr[xj xi] = (|i-j|+1)^alfa (whatever is easiest). Either way, Im not claiming that they are equal, but only interested in the tail behavior (|i-j| >> 1) so it should not matter.

jonalm 2010-03-16 04:09:49

Thank you so much Jason. Although it was a different correlation, the solutions were really interesting.

jonalm 2010-03-19 16:30:04

Answer 5

A:

Here's an intuitive / experimental approach that seems to work.

If b is an binary r.v., m is the mean of the binary r.v., c is the correlation you want, rand() generates a U(0,1) r.v., and d is the correlated binary r.v. you want:

d = if(rand() < c, b, if(rand() < m , 0, 1))

That is if a uniform r.v. is less than the desired correlation, d = b. Otherwise d = another random binary number.

I ran this 1000 times for a column of 2000 binary r.v.s. with m=.5 and c = .4 and c = .5 The correlation mean was exactly as specified, the distribution appeared to be normal. For a correlation of 0.4 the std deviation of the correlation was 0.02.

Sorry - I can't prove that this works all the time, but you have to admit, it sure is easy.

Grembo 2010-03-17 18:33:08

Reread the question: that's not the right correlation structure.

2010-03-17 18:41:45

Answer 6

+2 A:

Thanks for all your inputs. I found an answer to my question in the cute little article by Chul Gyu Park et al., so in case anyone run into the same problem, look up:

"A simple method for Generating Correlated Binary Variates" (jstor.org.stable/2684925)

for a simple algorithm. The algorithm works if all the elements in the correlation matrix are positive, and for a general marginal distribution Pr(x_i)=p_i.

j

jonalm 2010-03-18 04:57:04

Jason S 2010-03-19 17:52:59

ansaurus

tags:

views:

answers:

How to generate correlated binary variables

related questions