ansaurus

Question

Filtering away nearby points from a list

Answer 1

+1 A:

This sounds like color quantization, where you reduce the number of colors in an image. One way would be to plot the colors in space, and combine clusters into the center (or a weighted average) of a cluster.

The exact name of the algorithm that triggered this memory fails me, but I'll edit the answer if it pops up, but in the meantime, you should look at color quantization and see if some of the algorithms are useful.

Lasse V. Karlsen 2009-01-06 13:02:17

It definitely looks a lot like color quantization (now, that I have checked it up). Thanks! But it seems like most real-world color quantifications work with a known number of clusters (the palette). In my case I need to find that out myself (roughly, it's not a problem with an exact answer).

PEZ 2009-01-06 21:52:55

Answer 2

+1 A:

Start with the "Convex Hull" problem. You're also looking for some "convex hull"-like clusters.

Note that "clusters" is vague. You have an average mass across your field. Some points have above average mass, and some below average. How far above average means you've found a cluster? How far apart do nodes have to be to be part of a cluster or a separate cluster?

What's the difference between two mountain peaks and a ridge?

You have to compute a "topography" - joining all points with equal density into regions. This requires that you pick a spot and work your want out from a point radially, locating positions where the densities are equal. You can connect those points into regions.

If you picked your initial point wisely, the regions should nest. Picking your starting point is easy because you start at local highs.

S.Lott 2009-01-06 13:03:48

+1. I think that for the purpose and especially since we're talking about small bitmaps I'll venture down this path first.

PEZ 2009-01-06 21:40:04

Answer 3

+1 A:

Since you are already talking about mass, why not a gravity based solution. A simple particle system would not need to be super accurate, and you would not have to run it for too long before you could make a much better guess at the number of clusters.

If you have a better idea about cluster numbers, k-means nearest neighbour becomes feasible.

jamesh 2009-01-06 13:21:28

Thanks! And +1. When I think about the problem I tend to see a gravity "net". I'll take your clues and see where it leads me.

PEZ 2009-01-06 21:16:04

Answer 4

+3 A:

Shane MacLaughlin 2009-01-06 13:42:05

Thanks. The list that this question talks about contains points of interpolated mass. The points with the highest mass in each cluster can therefore be regarded as the center of mass for each cluster. Now I'll look at that triangulation link and see if it's what I'm looking for.

PEZ 2009-01-06 20:48:25

Answer 5

+4 A:

It sounds to me like you're looking for the K-means algorithm.

Nick Johnson 2009-01-06 13:56:27

I think that for K-means the number of clusters needs to be known.

PEZ 2009-01-06 20:01:34

You're right, it does. The asker doesn't have a robust definition of what a 'cluster' is, though, which makes this more or less impossible.

Nick Johnson 2009-01-07 10:41:02

The question is more about filtering the list than clustering. But I see what you mean. If you check that other question up that's linked in this question, does it make the definition a bit clearer?

PEZ 2009-01-07 15:44:46

Answer 6

+5 A:

Just so you know, you are asking for a solution to an ill-posed problem: no definitive solution exists. That's fine...it just makes it more fun. Your problem is ill-posed mostly because you don't know how many clusters you want. Clustering is one of the key areas of machine learning and there a quite a few approaches that have been developed over the years.

As Arachnid pointed out, the k-means algorithm tends to be a good one and it's pretty easy to implement. The results depend critically on the initial guess made and on the number of desired clusters. To overcome the initial guess problem, it's common to run the algorithm many times with random initializations and pick the best result. You'll need to define what "best" means. One measure would be the mean squared distance of each point to its cluster center. If you want to automatically guess how many clusters there are, you should run the algorithm with a whole range of numbers of clusters. For any good "best" measure, more clusters will always look better than fewer, so you'll need a way to penalize having too many clusters. The MDL discussion on wikipedia is a good starting point.

K-means clustering is basically the simplest mixture model. Sometimes it's helpful to upgrade to a mixture of Gaussians learned by expectation maximization (described in the link just given). This can be more robust than k-means. It takes a little more effort to understand it, but when you do, it's not much harder than k-means to implement.

There are plenty of other clustering techniques such as agglomerative clustering and spectral clustering. Agglomerative clustering is pretty easy to implement, but choosing when to stop building the clusters can be tricky. If you do agglomerative clustering, you'll probably want to look at kd trees for faster nearest neighbor searches. smacl's answer describes one slightly different way of doing agglomerative clustering using a Voronoi diagram.

There are models that can automatically choose the number of clusters for you such as ones based on Latent Dirichlet Allocation, but they are a lot harder to understand an implement correctly.

You might also want to look at the mean-shift algorithm to see if it's closer to what you really want.

Mr Fooz 2009-01-06 15:46:16

Interesting, stuff. The voronoi diagram is also referred to as a Dirichlet tessellation, I wasn't aware of LDAs. I think if you take the mass as scalar, the problem is solvable, and basically equates to gravimetric modelling as done when surveying geoids for GPS.

Shane MacLaughlin 2009-01-06 16:17:11

Thanks! Lots of interesting stuff here. Following your reasoning and some of your links I understand that I should probably not be too quick in dividing my problem. Found a PDF talking about mean-shift: http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf

PEZ 2009-01-06 21:20:00

@smacl: note that Dirichlet is a rather overloaded term (the guy was a pretty important mathematician). In LDA, it refers to a Dirichlet distribution, which serves as a prior over the clusters. Before now, I hadn't heard of his other work such as that on space tessellation.

Mr Fooz 2009-01-06 22:50:57

Very well put. +1.

Nick Johnson 2009-01-07 10:42:21

ansaurus

tags:

views:

answers:

Filtering away nearby points from a list

related questions