I wan to do a linear regression in R using the lm()
function. My data is an annual time series with one field for year (22 years) and another for state (50 states). I want to fit a regression for each state so that at the end I have a vector of lm responses. I can imagine doing for loop for each state then doing the regression inside the loop and adding the results of each regression to a vector. That does not seem very R-like, however. In SAS I would do a 'by' statement and in SQL I would do a 'group by'. What's the R way of doing this?
views:
913answers:
4
+2
A:
## make fake data
> ngroups <- 2
> group <- 1:ngroups
> nobs <- 100
> dta <- data.frame(group=rep(group,each=nobs),y=rnorm(nobs*ngroups),x=runif(nobs*ngroups))
> head(dta)
group y x
1 1 0.6482007 0.5429575
2 1 -0.4637118 0.7052843
3 1 -0.5129840 0.7312955
4 1 -0.6612649 0.9028034
5 1 -0.5197448 0.1661308
6 1 0.4240346 0.8944253
>
> ## function to extract the results of one model
> foo <- function(z) {
+ ## coef and se in a data frame
+ mr <- data.frame(coef(summary(lm(y~x,data=z))))
+ ## put row names (predictors/indep variables)
+ mr$predictor <- rownames(mr)
+ mr
+ }
> ## see that it works
> foo(subset(dta,group==1))
Estimate Std..Error t.value Pr...t.. predictor
(Intercept) 0.2176477 0.1919140 1.134090 0.2595235 (Intercept)
x -0.3669890 0.3321875 -1.104765 0.2719666 x
> ## one option: use command by
> res <- by(dta,dta$group,foo)
> res
dta$group: 1
Estimate Std..Error t.value Pr...t.. predictor
(Intercept) 0.2176477 0.1919140 1.134090 0.2595235 (Intercept)
x -0.3669890 0.3321875 -1.104765 0.2719666 x
------------------------------------------------------------
dta$group: 2
Estimate Std..Error t.value Pr...t.. predictor
(Intercept) -0.04039422 0.1682335 -0.2401081 0.8107480 (Intercept)
x 0.06286456 0.3020321 0.2081387 0.8355526 x
> ## using package plyr is better
> library(plyr)
> res <- ddply(dta,"group",foo)
> res
group Estimate Std..Error t.value Pr...t.. predictor
1 1 0.21764767 0.1919140 1.1340897 0.2595235 (Intercept)
2 1 -0.36698898 0.3321875 -1.1047647 0.2719666 x
3 2 -0.04039422 0.1682335 -0.2401081 0.8107480 (Intercept)
4 2 0.06286456 0.3020321 0.2081387 0.8355526 x
>
Eduardo Leoni
2009-07-23 04:22:58
+4
A:
Here's one way using the lme4
package.
> library(lme4)
> d <- data.frame(state=rep(c('NY', 'CA'), c(10, 10)),
+ year=rep(1:10, 2),
+ response=c(rnorm(10), rnorm(10)))
> xyplot(response ~ year, groups=state, data=d, type='l')
> fits <- lmList(response ~ year | state, data=d)
> fits
Call: lmList(formula = response ~ year | state, data = d)
Coefficients:
(Intercept) year
CA -1.34420990 0.17139963
NY 0.00196176 -0.01852429
Degrees of freedom: 20 total; 16 residual
Residual standard error: 0.8201316
ars
2009-07-23 04:55:41
+3
A:
In my opinion is a mixed linear model a better approach for this kind of data. The code below given in the fixed effect the overall trend. The random effects indicate how the trend for each individual state differ from the global trend. The correlation structure takes the temporal autocorrelation into account. Have a look at Pinheiro & Bates (Mixed Effects Models in S and S-Plus).
library(nlme)
lme(response ~ year, random = ~year|state, correlation = corAR1(~year))
Thierry
2009-07-24 11:32:35
This is a really good general stats theory answer which makes me think about some things I had not considered. The application which caused me to ask the question would not be applicable to this solution, but I'm glad you brought it up. Thank you.
JD Long
2009-07-24 14:11:32
It's not a good idea to start with a mixed model - how do you know that any of the assumptions are warranted?
hadley
2009-07-31 19:26:41
One should check the assumption by model validation (and knowledge of the data). BTW you cannot warrant the assumption on the individual lm's either. You would have to validate all models seperately.
Thierry
2009-07-31 22:59:39
+9
A:
Here's an approach using the plyr package:
d <- data.frame(
state = rep(c('NY', 'CA'), 10),
year = rep(1:10, 2),
response= rnorm(20)
)
library(plyr)
# Break up d by state, then fit the specified model to each piece and
# return a list
models <- dlply(d, "state", function(df)
lm(response ~ year, data = df))
# Apply coef to each model and return a data frame
ldply(models, coef)
# Print the summary of each model
l_ply(models, summary, .print = TRUE)
hadley
2009-07-31 19:30:45