views:

27

answers:

1

Hi,

I'm trying to reduce dataset dimension. PCA is a good metric but that gives me new dataset. My goal is to determine from number of events (e.g. 60) and number of trials (e.g. 6) which events are more relevant.

For example:

  • 1st, 3rd, 21st, 45th ... (N total) events are good enough to approximate behavior of dataset.
    That will allow me to discard 60-N events, and to deal with only N.

For now, I'm calculating covariance matrix, and take events for which correlation is smaller than some threshold.
Is there some official metric or math function for this???

Thanks.

A: 

What you are describing is not dimensionality reduction, but rather sampling. If your data is labeled (which I couldn't understand from your question), then most probably you would want to perform stratified sampling - a random sampling that ensures that each label is sampled with a probability that approximately equals to that in the original data set. See this Wikipedia article on sampling techniques. It provides a list of good reading material on this matter

bgbg