3D clustering Algorithm

Problem Statement: I have the following problem:

There are more than a billion points in 3D space. The goal is to find the top N points which has largest number of neighbors within given distance R. Another condition is that the distance between any two points of those top N points must be greater than R. The distribution of those points are not uniform. It is very common that certain regions of the space contain a lot of points.

Goal: To find an algorithm that can scale well to many processors and has a small memory requirement.

Thoughts: Normal spatial decomposition is not sufficient for this kind of problem due to the non-uniform distribution. irregular spatial decomposition that evenly divide the number of points may help us the problem. I will really appreciate that if someone can shed some lights on how to solve this problem.

I don't have a definite answer for you, but I have a suggestion for an approach that might yield a solution.

I think it's worth investigating locality-sensitive hashing. I think dividing the points evenly and then applying this kind of LSH to each set should be readily parallelisable. If you design your hashing algorithm such that the bucket size is defined in terms of R, it seems likely that for a given set of points divided into buckets, the points satisfying your criteria are likely to exist in the fullest buckets.

Having performed this locally, perhaps you can apply some kind of map-reduce-style strategy to combine spatial buckets from different parallel runs of the LSH algorithm in a step-wise manner, making use of the fact that you can begin to exclude parts of your problem space by discounting entire buckets. Obviously you'll have to be careful about edge cases that span different buckets, but I suspect that at each stage of merging, you could apply different bucket sizes/offsets such that you remove this effect (e.g. perform merging spatially equivalent buckets, as well as adjacent buckets). I believe this method could be used to keep memory requirements small (i.e. you shouldn't need to store much more than the points themselves at any given moment, and you are always operating on small(ish) subsets).

If you're looking for some kind of heuristic then I think this result will immediately yield something resembling a "good" solution - i.e. it will give you a small number of probable points which you can check satisfy your criteria. If you are looking for an exact answer, then you are going to have to apply some other methods to trim the search space as you begin to merge parallel buckets.

Another thought I had was that this could relate to finding the metric k-center. It's definitely not the exact same problem, but perhaps some of the methods used in solving that are applicable in this case. The problem is that this assumes you have a metric space in which computing the distance metric is possible - in your case, however, the presence of a billion points makes it undesirable and difficult to perform any kind of global traversal (e.g. sorting of the distances between points). As I said, just a thought, and perhaps a source of further inspiration.

This is indeed very similar to maximum coverage problem. The object function is different. The object here is to minimize: Sum((Ci-Ct/K)^2), i = 1,..K, where K is the number of the partition, Ci is the number of points in Set i and Ct is the total number of points.

Teng Lin 2010-08-15 04:17:19

Ci is not exactly the variable we want to optimize. But it should be close enough. Ideally, Ci should also include the number of points in its nearest neighbor cells on the surface. Since the cell size is R, if distance calculation only needs to extend its nearest neighbor cell.

Teng Lin 2010-08-15 04:21:11

One idea I have in mind now is that to create LxMxN cells ( length for each cell is R). The number of points can be easily recorded for each cell. And then a cluster algorithm can be used to find the dense clusters. Since there are way too many points, it is infeasible to perform the clustering algorithm for individual point. However, we can reduce the resolution of the counts in the LxMxN cell by dividing the counts by an arbitrary number. For instance, Ct/(LMN). And then greedy algorithm can be utilized to make the partition. Not sure if that is on the right track.

Teng Lin 2010-08-15 04:23:55

I'm not quite sure what those first two comments were in reference to. However, the last approach you described is basically exactly what I was suggesting - some kind of spatial division (e.g. LSH) that allows you to count points in buckets to locate plausible solutions that you can then check (also, if your bucket size is in terms of R, then you can actually start to just do normal distance metrics on points in the current bucket and adjacent buckets).

Gian 2010-08-15 14:33:22

The first two comments are related to the partition of the data after placing them into the grid.

Teng Lin 2010-08-16 03:57:31

ansaurus

tags:

views:

answers:

3D clustering Algorithm

related questions