tags:

views:

119

answers:

2

I have a few hundred thousand measurements where the dependent variable is a probability, and would like to use logistic regression. However, the covariates I have are all categorical, and worse, are all nested. By this I mean that if a certain measurement has "city - Phoenix" then obviously it is certain to have "state - Arizona" and "country - U.S." I have four such factors - the most granular has some 20k levels, but if need be I could do without that one, I think. I also have a few non-nested categorical covariates (only four or so, with maybe three different levels each). What I am most interested in is prediction - given a new observation in some city, I would like to know the relevant probability/dependent variable. I am not interested as much in the related inferential machinery - standard deviations, etc - at least as of now. I am hoping I can afford to be sloppy. However, I would love to have that information unless it requires methods that are more computationally expensive. Does anyone have any advice on how to attack this? I have looked into mixed effects, but am not sure it is what I am looking for.

+1  A: 

I think this is more of model design question than on R specifically; as such, I'd like to address the context of the question first then the appropriate R packages.

If your dependent variable is a probability, e.g., $y\in[0,1]$, a logistic regression is not data appropriate---particularly given that you are interested in predicting probabilities out of sample. The logistic is going to be modeling the contribution of the independent variables to the probability that your dependent variable flips from a zero to a one, and since your variable is continuous and truncated you need a different specification.

I think your latter intuition about mixed effects is a good one. Since your observations are nested, i.e., US <-> AZ <-> Phoenix, a multi-level model, or in this case a hierarchical linear model, may be the best specification for your data. The best R packages for this type of modeling are multilevel and nlme, and there is an excellent introduction to both multi-level models in R and nlme available here. You may be particularly interested in the discussion of data manipulation for multi-level modeling, which begins on page 26.

DrewConway
A: 

I would suggest looking into penalised regressions like the elastic net. The elastic net is used in text mining where each column represents the present or absence of a single word, and there maybe hundreds of thousands of variables, an analogous problem to yours. A good place to start with R would be the glmnet package and its accompanying JSS paper: http://www.jstatsoft.org/v33/i01/.

hadley