views:

116

answers:

4

So I have about 16,000 75-dimensional data points, and for each point I want to find its k nearest neighbours (using euclidean distance, currently k=2 if this makes it easiser)

My first thought was to use a kd-tree for this, but as it turns out they become rather inefficient as the number of dimension grows. In my sample implementation, its only slightly faster than exhaustive search.

My next idea would be using PCA (Principal Component Analysis) to reduce the number of dimensions, but I was wondering: Is there some clever algorithm or data structure to solve this exactly in reasonable time?

+1  A: 

You could conceivably use Morton Codes, but with 75 dimensions they're going to be huge. And if all you have is 16,000 data points, exhaustive search shouldn't take too long.

Anon
Thats 16,000 comparisons for each point, so 16,000^2 comparisons overall. Its not taking forever, but at least hours to days. I'll read up on Morton codes, though.
Benno
@Benno: Ah, I assumed that you were looking at nearest neighbors for arriving points. If you're looking to order by locality, then Morton Codes are probably the way to go, large or not. You should Google for searching Z-order curves, as there's some weirdness to it. And you might want to look at the Wikipedia article on GeoHash, which is essentially a text translation of a Morton Code. (and finally, note that Morton codes as described are for 2 dimensions, but there's no real limit to the number of dimensions you can interleave)
Anon
A: 

No reason to believe this is NP-complete. You're not really optimizing anything and I'd have a hard time figure out how to convert this to another NP-complete problem (I have Garey and Johnson on my shelf and can't find anything similar). Really, I'd just pursue more efficient methods of searching and sorting. If you have n observations, you have to calculate n x n distances right up front. Then for every observation, you need to pick out the top k nearest neighbors. That's n squared for the distance calculation, n log (n) for the sort, but you have to do the sort n times (different for EVERY value of n). Messy, but still polynomial time to get your answers.

Grembo
+1  A: 

BK-Tree isn't such a bad thought. Take a look at Nick's Blog on Levenshtein Automata. While his focus is strings it should give you a spring board for other approaches. The other thing I can think of are R-Trees, however I don't know if they've been generalized for large dimensions. I can't say more than that since I neither have used them directly nor implemented them myself.

wheaties
+2  A: 

The Wikipedia article for kd-trees has a link to the ANN library:

ANN is a library written in C++, which supports data structures and algorithms for both exact and approximate nearest neighbor searching in arbitrarily high dimensions.

Based on our own experience, ANN performs quite efficiently for point sets ranging in size from thousands to hundreds of thousands, and in dimensions as high as 20. (For applications in significantly higher dimensions, the results are rather spotty, but you might try it anyway.)

As far as algorithm/data structures are concerned:

The library implements a number of different data structures, based on kd-trees and box-decomposition trees, and employs a couple of different search strategies.

I'd try it first directly and if that doesn't produce satisfactory results I'd use it with the data set after applying PCA/ICA (since it's quite unlikely your going to end up with few enough dimensions for a kd-tree to handle).

Eugen Constantin Dinca
+1 for ANN, a pretty hot topic in information retrieval lately. Search "FLANN" for one implementation by Lowe, and "ANN" by David Mount. Also search for the related "locality sensitive hashing".
Steve
Eugen Constantin Dinca
Thats what I ended up using, and (along with a re-write of the program in c++) its pretty fast (<1 second).
Benno
@Benno: Glad it worked for you. As a curiosity, did you applied it to the original data (75 dimensions) or on what resulted after doing PCA/ICA?
Eugen Constantin Dinca
I did apply PCA, but only to change base vectors, not to reduce the dimension. It has its limits though, for another dataset with 240,000 points it took about 15 minutes.
Benno
@Benno: Tnx for the update.
Eugen Constantin Dinca