tags:

views:

55

answers:

2

I have a data set with some null values in one field. When I try to run a linear regression, it treats the integers in the field as category indicators, not numbers.

E.g., for a field that contains no null values...

summary(lm(rank ~ num_ays, data=a)),

Returns:

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 10.607597   0.019927 532.317  < 2e-16 ***
num_ays      0.021955   0.007771   2.825  0.00473 ** 

But when I run the same model on a field with null values, I get:

Coefficients:
              Estimate Std. Error  t value Pr(>|t|)    

(Intercept)  1.225e+01  1.070e+00   11.446  < 2e-16 ***
num_azs0    -1.780e+00  1.071e+00   -1.663  0.09637 .  
num_azs1    -1.103e+00  1.071e+00   -1.030  0.30322    
num_azs10   -9.297e-01  1.080e+00   -0.861  0.38940    
num_azs100   1.750e+00  5.764e+00    0.304  0.76141    
num_azs101  -6.250e+00  4.145e+00   -1.508  0.13161    

What's the best and/or most efficient way to handle this, and what are the tradeoffs?

+1  A: 

You can ignore null values like so:

a[!is.null(a$num_ays),]
Shane
Thanks, Shane. I tried to apply that using: summary(lm(rank ~ num_ays, data=a[!is.null(a$num_ays)])). It gave me the same output, though.
Dan
@Shane `is.null` returns `TRUE` if object is `NULL` and `FALSE` otherwise. So your construct returns all rows of `a` or 0-row `data.frame`. I'm pretty sure you was thinking about `is.na` ;)
Marek
+1  A: 

And to build on Shane's answer: you can use that in the data= argument of lm():

summary(lm(rank ~ num_ays, data=a[!is.null(a$num_ays),]))
Dirk Eddelbuettel
Thanks, Dirk. I tried that but it's still treating the numbers in the column as category labels... same result as before. Am I missing something else as well?
Dan
You are being tripped up by factors. That is a different issue. Try and search for "[r] factor" (ie the term `factor` within posts tagged `[r]` for R). You will need to read the data differently, and/or convert it.
Dirk Eddelbuettel
Isn't better to use `subset` argument of `lm`?
Marek
Factors was it—thanks for the help!
Dan