About curse of dimensionality

Basically the distance measurement is still correct, however, it becomes meaningless when you have "real world" data, which is noisy.

The effect we talk about here is that a high distance between two points in one dimension gets quickly overshadowed by small distances in all the other dimensions. That's why in the end, all points somewhat end up with the same distance. There exists a good illustration for this:

Say we want to classify data based on their value in each dimension. We just say we divide each dimension once (which has a range of 0..1). Values in [0, 0.5) are positive, values in [0.5, 1] are negative. With this rule, in 3 dimensions, 12.5% of the space are covered. In 5 dimensions, it is only 3.1%. In 10 dimensions, it is less than 0.1%.

So in each dimension we still allow half of the overall value range! Which is quite much. But all of it ends up in 0.1% of the total space -- the differences between these data points are huge in each dimension, but negligible over the whole space.

You can go further and say in each dimension you cut only 10% of the range. So you allow values in [0, 0.9). You still end up with less than 35% of the whole space covered in 10 dimensions. In 50 dimensions, it is 0.5%. So you see, wide ranges of data in each dimension are crammed into a very small portion of your search space.

That's why you need dimensionality reduction, where you basically disregard differences on less informative axes.

ansaurus

tags:

views:

answers:

About curse of dimensionality

related questions