views:

428

answers:

3

Background

Right now, I'm creating a multiple-predictor linear model and generating diagnostic plots to assess regression assumptions. (It's for a multiple regression analysis stats class that I'm loving at the moment :-)

My textbook (Cohen, Cohen, West, and Aiken 2003) recommends plotting each predictor against the residuals to make sure that:

  1. The residuals don't systematically covary with the predictor
  2. The residuals are homoscedastic with respect to each predictor in the model

On point (2), my textbook has this to say:

Some statistical packages allow the analyst to plot lowess fit lines at the mean of the residuals (0-line), 1 standard deviation above the mean, and 1 standard deviation below the mean of the residuals....In the present case {their example}, the two lines {mean + 1sd and mean - 1sd} remain roughly parallel to the lowess {0} line, consistent with the interpretation that the variance of the residuals does not change as a function of X. (p. 131)

How can I modify loess lines?

I know how to generate a scatterplot with a "0-line,":

    # First, I'll make a simple linear model and get its diagnostic stats
    library(ggplot2)
    data(cars)
    mod <- fortify(lm(speed ~ dist, data = cars))
    attach(mod)
    str(mod)

    # Now I want to make sure the residuals are homoscedastic
    qplot (x = dist, y = .resid, data = mod) + 
    geom_smooth(se = FALSE) # "se = FALSE" Removes the standard error bands

But does anyone know how I can use ggplot2 and qplot to generate plots where the 0-line, "mean + 1sd" AND "mean - 1sd" lines would be superimposed? Is that a weird/complex question to be asking?

+1  A: 

Could you calculate the +/- standard deviation values from the data and add a fitted curve of them to the plot?

celenius
Great suggestion, though I don't think it would quite work. Cohen et al (2003) add this: "The other two lines {mean + 1sd and mean - 1sd} are created using the lowess procedure to estimate values 1 standard deviation above and 1 standard deviation below the lowess line."Since the sd calculations (it seems) are effectively point-wise computations in the lowess algorithm, I'm not sure your suggestion would accurately produce what we want. If what you mean is to calculate the sd of the dataset, then add/subtract that from the lowess estimates, that would create exact copies of the curve :-(
briandk
Ah I see... Thanks for explaining. One other good source of ggplot2 advice is [email protected], so if you don't resolve your question here, that is an option (also the ggplot2 tag exists)
celenius
+1  A: 

Have a look at my question "modify lm or loess function.."

I am not sure I followed your question very well, but maybe a:

+ stat_smooth(method=yourfunction)

will work, provided that you define your function as described here.

dalloliogm
@dalloliogm - Thanks for the suggestion! It turns out (see below) that I don't even need to specify a custom `yourfunction`, because the default confidence bands for `geom_smooth()` are bounded by precisely the lines I was trying to plot all along.
briandk
+2  A: 

Apology

Folks, I want to apologize for my ignorance. Hadley is absolutely right, and the answer was right in front of me all along. As I suspected, my question was born of statistical, rather than programmatic ignorance.

We get the 68% Confidence Interval for Free

geom_smooth() defaults to loess smoothing, and it superimposes the +1sd and -1sd lines as part of the deal. That's what Hadley meant when he said "Isn't that just a 68% confidence interval?" I just completely forgot that's what the 68% interval is, and kept searching for something that I already knew how to do. It didn't help that I'd actually turned the confidence intervals off in my code by specifying geom_smooth(se = FALSE).

What my Sample Code Should Have Looked Like

# First, I'll make a simple linear model and get its diagnostic stats.
library(ggplot2)
data(cars)
mod <- fortify(lm(speed ~ dist, data = cars))
attach(mod)
str(mod)

# Now I want to make sure the residuals are homoscedastic.
# By default, geom_smooth is loess and includes the 68% standard error bands.
qplot (x = dist, y = .resid, data = mod) + 
geom_abline(slope = 0, intercept = 0) +
geom_smooth() 

What I've Learned

Hadley implemented a really beautiful and simple way to get what I'd wanted all along. But because I was focused on loess lines, I lost sight of the fact that the 68% confidence interval was bounded by the very lines I needed. Sorry for the trouble, everyone.

briandk
fine, I thought there it was something strange in the question :-). Thank you for make it clearer, anyway
dalloliogm