views:

528

answers:

3

Given data of the following form

myDat = structure(list(Score = c(1.84, 2.24, 3.8, 2.3, 3.8, 4.55, 1.13, 
2.49, 3.74, 2.84, 3.3, 4.82, 1.74, 2.89, 3.39, 2.08, 3.99, 4.07, 
1.93, 2.39, 3.63, 2.55, 3.09, 4.76), Subject = c(1L, 1L, 1L, 
2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L, 7L, 
7L, 7L, 8L, 8L, 8L), Condition = c(0L, 0L, 0L, 1L, 1L, 1L, 0L, 
0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 
1L), Time = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 
1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L)), .Names = c("Score", 
"Subject", "Condition", "Time"), class = "data.frame", row.names = c(NA, 
-24L))

I would like to model Score as a function of Subject, Condition and Time. Each (human) Subject's score was measured three times, indicated by the variable Time, so I have repeated measures.

How can I build in R a random effects model with Subject effects fitted as random?

ADDENDUM: It's been asked how I generated these data. You guessed it, the data are fake as the day is long. Score is time plus random noise and being in Condition 1 adds a point to Score. It's instructive as a typical Psych setup. You have a task where people's score gets better with practice (time) and a drug (condition==1) that enhances score.

Here are some more realistic data for the purposes of this discussion. Now simulated participants have a random "skill" level that is added to their scores. Also, the factors are now strings.

myDat = structure(list(Score = c(1.62, 2.18, 2.3, 3.46, 3.85, 4.7, 1.41, 
2.21, 3.32, 2.73, 3.34, 3.27, 2.14, 2.73, 2.74, 3.39, 3.59, 4.01, 
1.81, 1.83, 3.22, 3.64, 3.51, 4.26), Subject = structure(c(1L, 
1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 
6L, 7L, 7L, 7L, 8L, 8L, 8L), .Label = c("A", "B", "C", "D", "E", 
"F", "G", "H"), class = "factor"), Condition = structure(c(1L, 
1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 
2L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("No", "Yes"), class = "factor"), 
    Time = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 
    2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("1PM", 
    "2PM", "3PM"), class = "factor")), .Names = c("Score", "Subject", 
"Condition", "Time"), class = "data.frame", row.names = c(NA, 
-24L))

See it:

library(ggplot2)
qplot(Time, Score, data = myDat, geom = "line", group = Subject, colour = factor(Condition))
+2  A: 

How is this as a start:

R> suppressMessages(library(lme4))
R> lmer(Score ~ Subject + Condition + Time + (Subject + Condition + Time|Subject) - 1, data=myDat)
Linear mixed model fit by REML 
Formula: Score ~ Subject + Condition + Time + (Subject + Condition + Time |      Subject) - 1 
   Data: myDat 
  AIC  BIC logLik deviance REMLdev
 53.3 69.8  -12.6     14.6    25.3
Random effects:
 Groups   Name        Variance Std.Dev. Corr                 
 Subject  (Intercept) 0.11796  0.3434                        
          Subject     0.02583  0.1607   -1.000               
          Condition   0.01291  0.1136    1.000 -1.000        
          Time        0.00254  0.0504    1.000 -1.000  1.000 
 Residual             0.10228  0.3198                        
Number of obs: 24, groups: Subject, 4

Fixed effects:
          Estimate Std. Error t value
Subject     0.1432     0.0657    2.18
Condition   0.9063     0.1423    6.37
Time        1.0586     0.0799   13.24

Correlation of Fixed Effects:
          Subjct Condtn
Condition -0.452       
Time      -0.820  0.134
R>

but you may need to set the repeated measures part up differently. This isn't really an area in which I play much, so buyer beware...

Dirk Eddelbuettel
Beat me to it =). But I think having subject as a regressor is a bit suspect because, while it's encoded as a number, it seems that it's semantically really just a factor.
Jonathan Chang
+1  A: 

using the nlme library...

Answering your stated question, you can create a random intecept mixed effect model using the following code:

> library(nlme)
> m1 <- lme(Score ~ Condition + Time + Condition*Time,
+ data = myDat, random = ~ 1 | Subject)
> summary(m1)
Linear mixed-effects model fit by REML
 Data: myDat 
       AIC      BIC    logLik
  31.69207 37.66646 -9.846036

Random effects:
 Formula: ~1 | Subject
         (Intercept)  Residual
StdDev: 5.214638e-06 0.3151035

Fixed effects: Score ~ Condition + Time + Condition * Time 
                   Value Std.Error DF  t-value p-value
(Intercept)    0.6208333 0.2406643 14 2.579666  0.0218
Condition      0.7841667 0.3403507  6 2.303996  0.0608
Time           0.9900000 0.1114059 14 8.886423  0.0000
Condition:Time 0.0637500 0.1575517 14 0.404629  0.6919
 Correlation: 
               (Intr) Condtn Time  
Condition      -0.707              
Time           -0.926  0.655       
Condition:Time  0.655 -0.926 -0.707

Standardized Within-Group Residuals:
       Min         Q1        Med         Q3        Max 
-1.5748794 -0.6704147  0.2069426  0.7467785  1.5153752 

Number of Observations: 24
Number of Groups: 8

The intercept variance is basically 0, indicating no within subject effect, so this model is not capturing the between time relationship well. A random intercept model is rarely the type of model you want for a repeated measures design. A random intercept model assumes that the correlations between all time points are equal. i.e. the correlation between time 1 and time 2 is the same as between time 1 and time 3. Under normal circumstances (perhaps not those generating your fake data) we would expect the later to be less than the former. An auto regressive structure is usually a better way to go.

> m2<-gls(Score ~ Condition + Time + Condition*Time,
+ data = myDat, correlation = corAR1(form = ~ Time | Subject))
> summary(m2)
Generalized least squares fit by REML
  Model: Score ~ Condition + Time + Condition * Time 
  Data: myDat 
       AIC      BIC    logLik
  25.45446 31.42886 -6.727232

Correlation Structure: AR(1)
 Formula: ~Time | Subject 
 Parameter estimate(s):
       Phi 
-0.5957973 

Coefficients:
                   Value Std.Error   t-value p-value
(Intercept)    0.6045402 0.1762743  3.429543  0.0027
Condition      0.8058448 0.2492895  3.232566  0.0042
Time           0.9900000 0.0845312 11.711652  0.0000
Condition:Time 0.0637500 0.1195452  0.533271  0.5997

 Correlation: 
               (Intr) Condtn Time  
Condition      -0.707              
Time           -0.959  0.678       
Condition:Time  0.678 -0.959 -0.707

Standardized residuals:
       Min         Q1        Med         Q3        Max 
-1.6850557 -0.6730898  0.2373639  0.8269703  1.5858942 

Residual standard error: 0.2976964 
Degrees of freedom: 24 total; 20 residual

Your data is showing a -.596 between time point correlation, which seems odd. normally there should, at a minimum be a positive correlation between time points. How was this data generated?

addendum:

With your new data we know that the data generating process is equivalent to a random intercept model (though that is not the most realistic for a longitudinal study. The visualization shows that the effect of time seems to be fairly linear, so we should feel comfortable treating it as a numeric variable.

> library(nlme)
> m1 <- lme(Score ~ Condition + as.numeric(Time) + Condition*as.numeric(Time),
+ data = myDat, random = ~ 1 | Subject)
> summary(m1)
Linear mixed-effects model fit by REML
 Data: myDat 
       AIC      BIC    logLik
  38.15055 44.12494 -13.07527

Random effects:
 Formula: ~1 | Subject
        (Intercept)  Residual
StdDev:   0.2457355 0.3173421

Fixed effects: Score ~ Condition + as.numeric(Time) + Condition * as.numeric(Time) 
                                  Value Std.Error DF   t-value p-value
(Intercept)                    1.142500 0.2717382 14  4.204415  0.0009
ConditionYes                   1.748333 0.3842958  6  4.549447  0.0039
as.numeric(Time)               0.575000 0.1121974 14  5.124898  0.0002
ConditionYes:as.numeric(Time) -0.197500 0.1586710 14 -1.244714  0.2337
 Correlation: 
                              (Intr) CndtnY as.(T)
ConditionYes                  -0.707              
as.numeric(Time)              -0.826  0.584       
ConditionYes:as.numeric(Time)  0.584 -0.826 -0.707

Standardized Within-Group Residuals:
        Min          Q1         Med          Q3         Max 
-1.44560367 -0.65018585  0.01864079  0.52930925  1.40824838 

Number of Observations: 24
Number of Groups: 8

We see a significant Condition effect, indicating that the 'yes' condition tends to have higher scores (by about 1.7), and a significant time effect, indicating that both groups go up over time. Supporting the plot, we find no differential effect of time between the two groups (the interaction). i.e. the slopes are the same.

Ian Fellows
@Ian - see the note I've added to the question
Dan Goldstein
+2  A: 
hadley
That is lovely!
Dan Goldstein