views:

155

answers:

1

I have a regression model in which the dependent variable is continuous but ninety percent of the independent variables are categorical(both ordered and unordered) and around thirty percent of the records have missing values(to make matters worse they are missing randomly without any pattern, that is, more that forty five percent of the data hava at least one missing value). There is no a priori theory to choose the specification of the model so one of the key tasks is dimension reduction before running the regression. While I am aware of several methods for dimension reduction for continuous variables I am not aware of a similar statical literature for categorical data (except, perhaps, as a part of correspondence analysis which is basically a variation of principal component analysis on frequency table). Let me also add that the dataset is of moderate size 500000 observations with 200 variables. I have two questions.

  1. Is there a good statistical reference out there for dimension reduction for categorical data along with robust imputation (I think the first issue is imputation and then dimension reduction)?
  2. This is linked to implementation of above problem. I have used R extensively earlier and tend to use transcan and impute function heavily for continuous variables and use a variation of tree method to impute categorical values. I have a working knowledge of Python so if something is nice out there for this purpose then I will use it. Any implementation pointers in python or R will be of great help. Thank you.
+4  A: 

Regarding imputation of categorical data, I would suggest to check the mice package. Also take a look at this presentation which explains how it imputes multivariate categorical data. Another package for Mutliple Imputation of Incomplete Multivariate Data is Amelia. Amelia includes some limited capacity to deal with ordinal and nominal variables.

As for dimensionality reduction for categorical data (i.e. a way to arrange variables into homogeneous clusters), I would suggest the method of Multiple Correspondence Analysis which will give you the latent variables that maximize the homogeneity of the clusters. Similarly to what is done in Principal Component Analysis (PCA) and Factor Analysis, the MCA solution can also be rotated to increase the components simplicity. The idea behind a rotation is to find subsets of variables which coincide more clearly with the rotated components. This implies that maximizing components simplicity can help in factor interpretation and in variables clustering. In R MCA methods are included in packages ade4, MASS, FactoMineR and ca (at least). As for FactoMineR, you can use it through a graphical interface if you add it as an extra menu to the ones already proposed by the Rcmdr package, installing the RcmdrPlugin.FactoMineR

gd047
Thank you. This is really helpful.