tags:

views:

757

answers:

2

I'm trying to normalize some data which I have in a data frame. I want to take each value and run it through the pnorm function along with the mean and standard deviation of the column the value lives in. Using loops, here's how I would write out what I want to do:

#example data
hist_data<-data.frame(matrix(rnorm(200,mean=5,sd=.5),nrow=20))

n<-dim(hist_data)[2] #columns=10
k<-dim(hist_data)[1] #rows   =20

#set up the data frame which we will populate with a loop
normalized<-data.frame(matrix(nrow=dim(hist_data)[1],ncol=dim(hist_data)[2]))

#hot loop in loop action
for (i in 1:n){
   for (j in 1:k){
      normalized[j,i]<-pnorm(hist_data[j,i],mean=mean(hist_data[,i]),sd=sd(hist_data[,i]))
   }  
}
normalized

It seems that in R there should be a handy dandy vector way of doing this. I thought I was smart so tried using the apply function:

#trouble ahead
hist_data<-data.frame(matrix(rnorm(200,mean=5,sd=.5),nrow=10))
normalized<-apply(hist_data,2,pnorm,mean=mean(hist_data),sd=sd(hist_data))
normalized

Much to my chagrin, that does NOT produce what I expected. The upper left and bottom right elements of the output are correct, but that's it. So how can I de-loopify my life?

Bonus points if you can tell me what my second code block is actually doing. Kind of a mystery to me still. :)

+3  A: 

You want:

normalize <- apply(hist_data, 2, function(x) pnorm(x, mean=mean(x), sd=sd(x)))

The problem is that you're passing in the individual column into pnorm, but the entire hist_data into both the mean & the sd.

As I mentioned on twitter, I'm no stats guy so I can't answer anything about what you're actually trying to do :)

geoffjentry
I think there is an extra comma in your example. I think there needn't be a comma after function(x). This is exactly what I wanted to do. And an example of how much more compact vector code is than looping code. Thanks so much for helping me with this. And for following #rstats in Twitter!
JD Long
Oops, yeah. I typed that in by hand, didn't c+p it. This is my exact line: normalize <- apply(hist_data, 2, function(x) pnorm(x, mean=mean(x), sd=sd(x)))
geoffjentry
A: 

I'm just curious what your goal is. Using the pnorm function, you are getting which percentile of a normal distribution with the specified mean and sd your data would correspond to. For example, if your data is -2,-1,0,1,2, which has mean 0 and sd 1.58, the results of your function would be 0.10 0.26 0.50 0.74 0.90, rounded to 2 digits. This means that your data would correspond to the 10th, 26th, 50th, 74th and 90th percentiles of the normal distribution with mean 0 and sd 1.58, if the data was truly from that distribution. I'm not sure why this is useful, so I hope to be enlightened

Abhijit
Well it has been a month since I asked the question and I don't recall _exactly_ what I was doing, but here's the general idea: I was building a monte carlo model of non-normal correlated distributions. In my real application the distributions were not normal. They were either Johnson or they were non-parametric (probably kernels) but I had a p function like pnorm or pjohnson. After taking the percentile I would then use the correlation matrix and fit a copula to the percentiles (now uniform between 0,1). I would then simulate correlated deviates. Then map those deviates back to 'real' values.
JD Long