I have a regression model in which the dependent variable is continuous but ninety percent of the independent variables are categorical(both ordered and unordered) and around thirty percent of the records have missing values(to make matters worse they are missing randomly without any pattern, that is, more that forty five percent of the data hava at least one missing value). There is no a priori theory to choose the specification of the model so one of the key tasks is dimension reduction before running the regression. While I am aware of several methods for dimension reduction for continuous variables I am not aware of a similar statical literature for categorical data (except, perhaps, as a part of correspondence analysis which is basically a variation of principal component analysis on frequency table). Let me also add that the dataset is of moderate size 500000 observations with 200 variables. I have two questions.
- Is there a good statistical reference out there for dimension reduction for categorical data along with robust imputation (I think the first issue is imputation and then dimension reduction)?
- This is linked to implementation of above problem. I have used R extensively earlier and tend to use transcan and impute function heavily for continuous variables and use a variation of tree method to impute categorical values. I have a working knowledge of Python so if something is nice out there for this purpose then I will use it. Any implementation pointers in python or R will be of great help. Thank you.