tags:

views:

95

answers:

1

I have a 104 attribute dataset called data. I want to reduce the number of attributes to 20 using prcomp function in R.

I did this:

pr = prcomp(data)

But pr only contains an instance of the prcomp class. How do I reduce the number of attributes in the original dataset to 20?

+1  A: 

First of all, prcomp does a principal component analysis. A principal component analysis makes as many components as there are variables. What you're looking for, is a factor analysis:

ff <- factanal(data,20)

see ?factanal

If you want to keep only the first 20 principal components as a new dataset, you can easily select them from the predict() function. Or even calculate them yourself :

x <- prcomp(USArrests, scale = TRUE)

tt <- predict(x) # the standard way

# below the matrix way
tt2 <- scale(USArrests,x$center,x$scale) %*% x$rotation

# with only 3 components instead of 4
tt3 <- predict(x)[,1:3]
tt4 <- scale(USArrests,x$center,x$scale) %*% x$rotation[,1:3]

But be aware of the fact that a factor analysis reducing your dataset to 20 factors is NOT the same as keeping the first 20 principal components of a PCA.

Joris Meys
In the `prcomp` function call, if I set the `tol` variable such that only 20 principal components are selected, and if I set the `retx` as TRUE, and access the `x` member of the return object, would that work too? I have done that and got 20 attributes instead of 104.
louzer
@louzer: There is no automated way in putting the tolerance level as to get a specific number of principal components. It is to cut out PC with a sd lower than a specific value. Tol is most commonly set at 1. Apart from that, this doesn't change the PC, contrary to `factanal`. So you don't get 20 components, you get 104 and ignore the last 84. That is something completely different. Please read in first on the differences between principal components and factor analysis. If you use the retx=T, then you can forget about tol and just do `pr$x[,1:20]`, similar to tt3 in my example.
Joris Meys
I see. Thanks. I will try factor analysis too. I just want 20 dimensions instead of 104, so as to do an approximate k-nearest neighbor search in a computationally feasible manner. I use that technique on every member of the my dataset to find true negatives which are most similar to true positives. This is because my raw dataset has a 100 times more true negatives than true positives. I want the number of instances in the positive dataset and the negative dataset to be the same, so as to make the SVM training possible on a desktop.
louzer
@louzer: You could try stats.stackexchange.com for a more in depth discussion of the statistical reasoning behind your approach. I'm not completely following what you want to do, but I'm positive there is a bunch of people that can give you some good advice and different approaches to the problem.
Joris Meys
@Joris: I will check it out.
louzer
@louzer If you only want this, it is better just to select random number of objects than to mess with attributes.
mbq