views:

89

answers:

6

Having a dataset and calculating statistics from it is easy. How about the other way around?

Let's say I know some variable has an average X, standard deviation Y and assume it has normal (Gaussian) distribution. What would be the best way to generate a "random" dataset (of arbitrary size) which will fit the distribution?

EDIT: This kind of develops from this question; I could make something based on that method, but I am wondering if there's a more efficient way to do it.

+6  A: 

You can generate standard normal random variables with the Box-Mueller method. Then to transform that to have mean mu and standard deviation sigma, multiply your samples by sigma and add mu. I.e. for each z from the standard normal, return mu + sigma*z.

John D. Cook
A: 

You could make it a kind of Monte Carlo simulation. Start with a wide random "acceptable range" and generate a few truly random values. Check your statistics and see if the average and variance are off. Adjust the "acceptable range" for the random values and add a few more values. Repeat until you have hit both your requirements and your population sample size.

Just off the top of my head, let me know what you think. :-)

eruciform
A: 

It is easy to generate dataset with normal distribution (see http://en.wikipedia.org/wiki/Box%E2%80%93Muller_transform ).
Remember that generated sample will not have exact N(0,1) distribution! You need to standarize it - substract mean and then divide by std deviation. Then You are free to transform this sample to Normal distribution with given parameters: multiply by std deviation and then add mean.

Tomek Tarczynski
+2  A: 

There are several methods to generate Gaussian random variables. The standard method is Box-Meuller which was mentioned earlier. A slightly faster version is here:

http://en.wikipedia.org/wiki/Ziggurat_algorithm

Here's the wikipedia reference on generating Gaussian variables

http://en.wikipedia.org/wiki/Normal_distribution#Generating_values_from_normal_distribution

Joel
+2  A: 

I'll give an example using R and the 2nd algorithm in the list here.

X<-4; Y<-2 # mean and std
z <- sapply(rep(0,100000), function(x) (sum(runif(12)) - 6) * Y + X)

plot(density(z))
> mean(z)
[1] 4.002347

> sd(z)
[1] 2.005114

> library(fUtilities)

> skewness(z,method ="moment")
[1] -0.003924771
attr(,"method")
[1] "moment"

> kurtosis(z,method ="moment")
[1] 2.882696
attr(,"method")
[1] "moment"
gd047
+1  A: 

This is really easy to do in Excel with the norminv() function. Example:

=norminv(rand(), 100, 15)

would generate a value from a normal distribution with mean of 100 and stdev of 15 (human IQs). Drag this formula down a column and you have as many values as you want.

el chief
+1 for no programming required
quantumSoup