Group detection in data sets

+3 A:

I think you are looking for something along the lines of a k-means clustering algorithm.

You should be able to find adequate implementations in most general purpose languages.

consultutah 2010-01-12 21:01:45

+3 A:

There are many choices, but if you are interested in the probability that a new data point belongs to a particular mixture, I would use a probabilistic approach such as Gaussian mixture modeling either estimated by maximum likelihood or Bayes.

Maximum likelihood estimation of mixtures models is implemented in Matlab.

Your requirement that the number of components is unknown makes your model more complex. The dominant probabilistic approach is to place a Dirichlet Process prior on the mixture distribution and estimate by some Bayesian method. For instance, see this paper on infinite Gaussian mixture models. The DP mixture model will give you inference over the number of components and the components each elements belong to, which is exactly what you want. Alternatively you could perform model selection on the number of components, but this is generally less elegant.

There are many implementation of DP mixture models models, but they may not be as convenient. For instance, here's a Matlab implementation.

Your graph suggests you are an R user. In that case, if you are looking for prepacked solutions, the answer to your question lies on this Task View for cluster analysis.

Tristan 2010-01-12 22:14:13

+1 A:

You need one of clustering algorithms. All of them can be devided in 2 groups:

you specify number of groups (clusters) - 2 clusters in your example
algorithm try to guess correct number of clusters by itself

If you want algorithm of 1st type then K-Means is what you really need.

If you want algorithm of 2nd type then you probably need one of hierarchical clustering algorithms. I haven't ever implement any of them. But I see an easy way to improve K-means in such way thay it will be unnecessary to specify number of clusters.

Roman 2010-01-12 22:39:11

ansaurus

tags:

views:

answers:

Group detection in data sets

related questions