tags:

views:

165

answers:

3

Hello Everybody !

I want to use R to perform a stepwise linear Regression using p-values as a selection criterion e.g. at each step dropping variables that have the highest i.e. the most insignificant p-values, stopping when all values are significant defined by some treshold alpha.

I am totally aware that I should use the AIC (e.g. command step or stepAIC) or some other criterion instead, but my boss has no grasp of statistics and insist on using p-values.

If nesseary I could program my own routine, but I am wondering if there is an already implemented version of this.

Thanks a lot, Dainis

+3  A: 

Here is an example. Start with the most complicated model: this includes interactions between all three explanatory variables.

model1 <-lm (ozone~temp*wind*rad)
summary(model1)

Coefficients:
Estimate Std.Error t value Pr(>t)
(Intercept) 5.683e+02 2.073e+02 2.741 0.00725 **
temp          -1.076e+01 4.303e+00 -2.501 0.01401 *
wind          -3.237e+01 1.173e+01 -2.760 0.00687 **
rad           -3.117e-01 5.585e-01 -0.558 0.57799
temp:wind      2.377e-01 1.367e-01 1.739 0.08519   
temp:rad       8.402e-03 7.512e-03 1.119 0.26602
wind:rad       2.054e-02 4.892e-02 0.420 0.47552
temp:wind:rad -4.324e-04 6.595e-04 -0.656 0.51358

The three-way interaction is clearly not significant. This is how you remove it, to begin the process of model simplification:

model2 <- update(model1,~. - temp:wind:rad)
summary(model2)

Depending on the results, you can continue simplifying your model:

model3 <- update(model2,~. - temp:rad)
summary(model3)
...

Alternatively you can use the automatic model simplification function step, to see how well it does:

model_step <- step(model1)
gd047
Thanks for your answer. I just realized that I did not state my question clearlly enough: I am looking for a command or package that does this automatically, using the p-value as a criterion. The command "step" uses the AIC. I could also do it by "hand" as you suggets or programm some routine that does this automatically, but an already implemented version would come in very handy.
DainisZ
+6  A: 

Show your boss the following :

set.seed(100)
x1 <- runif(100,0,1)
x2 <- as.factor(sample(letters[1:3],100,replace=T))

y <- x1+x1*(x2=="a")+2*(x2=="b")+rnorm(100)
summary(lm(y~x1*x2))

Which gives :

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.1525     0.3066  -0.498  0.61995    
x1            1.8693     0.6045   3.092  0.00261 ** 
x2b           2.5149     0.4334   5.802 8.77e-08 ***
x2c           0.3089     0.4475   0.690  0.49180    
x1:x2b       -1.1239     0.8022  -1.401  0.16451    
x1:x2c       -1.0497     0.7873  -1.333  0.18566    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Now, based on the p-values you would exclude which one? x2 is most significant and most non-significant at the same time.


Edit : To clarify : This exaxmple is not the best, as indicated in the comments. The procedure in Stata and SPSS is AFAIK also not based on the p-values of the T-test on the coefficients, but on the F-test after removal of one of the variables.

I have a function that does exactly that. This is a selection on "the p-value", but not of the T-test on the coefficients or on the anova results. Well, feel free to use it if it looks useful to you.

#####################################
# Automated model selection
# Author      : Joris Meys
# version     : 0.2
# date        : 12/01/09
#####################################
#CHANGE LOG
# 0.2   : check for empty scopevar vector
#####################################

# Function has.interaction checks whether x is part of a term in terms
# terms is a vector with names of terms from a model
has.interaction <- function(x,terms){
    out <- sapply(terms,function(i){
        sum(1-(strsplit(x,":")[[1]] %in% strsplit(i,":")[[1]]))==0
    })
    return(sum(out)>0)
}

# Function Model.select
# model is the lm object of the full model
# keep is a list of model terms to keep in the model at all times
# sig gives the significance for removal of a variable. Can be 0.1 too (see SPSS)
# verbose=T gives the F-tests, dropped var and resulting model after 
model.select <- function(model,keep,sig=0.05,verbose=F){
      counter=1
      # check input
      if(!is(model,"lm")) stop(paste(deparse(substitute(model)),"is not an lm object\n"))
      # calculate scope for drop1 function
      terms <- attr(model$terms,"term.labels")
      if(missing(keep)){ # set scopevars to all terms
          scopevars <- terms
      } else{            # select the scopevars if keep is used
          index <- match(keep,terms)
          # check if all is specified correctly
          if(sum(is.na(index))>0){
              novar <- keep[is.na(index)]
              warning(paste(
                  c(novar,"cannot be found in the model",
                  "\nThese terms are ignored in the model selection."),
                  collapse=" "))
              index <- as.vector(na.omit(index))
          }
          scopevars <- terms[-index]
      }

      # Backward model selection : 

      while(T){
          # extract the test statistics from drop.
          test <- drop1(model, scope=scopevars,test="F")

          if(verbose){
              cat("-------------STEP ",counter,"-------------\n",
              "The drop statistics : \n")
              print(test)
          }

          pval <- test[,dim(test)[2]]

          names(pval) <- rownames(test)
          pval <- sort(pval,decreasing=T)

          if(sum(is.na(pval))>0) stop(paste("Model",
              deparse(substitute(model)),"is invalid. Check if all coefficients are estimated."))

          # check if all significant
          if(pval[1]<sig) break # stops the loop if all remaining vars are sign.

          # select var to drop
          i=1
          while(T){
              dropvar <- names(pval)[i]
              check.terms <- terms[-match(dropvar,terms)]
              x <- has.interaction(dropvar,check.terms)
              if(x){i=i+1;next} else {break}              
          } # end while(T) drop var

          if(pval[i]<sig) break # stops the loop if var to remove is significant

          if(verbose){
             cat("\n--------\nTerm dropped in step",counter,":",dropvar,"\n--------\n\n")              
          }

          #update terms, scopevars and model
          scopevars <- scopevars[-match(dropvar,scopevars)]
          terms <- terms[-match(dropvar,terms)]

          formul <- as.formula(paste(".~.-",dropvar))
          model <- update(model,formul)

          if(length(scopevars)==0) {
              warning("All variables are thrown out of the model.\n",
              "No model could be specified.")
              return()
          }
          counter=counter+1
      } # end while(T) main loop
      return(model)
}
Joris Meys
Nice example - I could have used this a couple of months ago!
Matt Parker
I think it's obvious the decision is not based exclusively on the p-values. You start by removing the least significant highest-order interactions and then the potential quadratic terms, before considering any explanatory variables for removal.
gd047
off course it isn't. But remove the interaction term and you'll see the situation remains the same. So you should actually go to an anova, but then you run into the problem of which type. And that's a can of worms I'm not going to open.
Joris Meys
No chance argueing with the boss on a level that suffisticated about statistics ;-) Nevertheless I'll try and will do the stepwise selection "by hand" untill then. It's just weird that there isn't a R function for that, even when the approach using p-values comes from the time when there were too high computational costs of computiong the AIC ...
DainisZ
Perhaps the `treatment` contrasts is not the best option in your case. Giving `options(contrasts=c("contr.sum","contr.poly"))` would produce radically different results.
gd047
I know, it's far from the best example. It was just to bring a point across. In any case, the "p-value based" method is based on an F-test, not on the t-test of the coefficients. So the function I added should give the OP what he wants.
Joris Meys
Thanks for the great help. You are right, for the "p-value based" method I should use the F-Test, forgot about that. Also thanks for posting your code, that also helps alot.
DainisZ
A: 

If you are just trying to get the best predictive model, then perhaps it doesn't matter to much, but for anything else, don't both with this sort of model selection. It is wrong. Use a shrinkage methods such as ridge regression (in lm.ridge() in package MASS for example), or the lasso, or the elasticnet (a combination of ridge and lasso constraints). Of these, only the lasso and elastic net will do some form of model selection, i.e. force the coefficients of some covariates to zero.

See the Regularization and Shrinkage section of the Machine Learning task view on CRAN.

Gavin Simpson