views:

332

answers:

4

Hi, I have a problem where I have a set of numbers eg.

5, 7, 7, 8, 8, 8, 7, 20, 23, 23, 24, 24, 24, 25

In the above set, there is two "clusters" of numbers, I want to write a program to find the centers of these clusters. Could you call them attractors as in Fractal theory?

So the program would, I guess, find that the set can be divided into two:

A - 5, 7, 7, 8, 8, 8, 7

B - 20, 23, 23, 24, 24, 24, 25

Set A can then have it average calculated, set B can have its average calculated then I have the two centers of the attractors.

Maybe this is a simple problem for a good math/statistics person? Can anyone point me in the right direction? I may have between 1 and 5 "attractors/clusters".

A: 

Like that?

public class Cluster {
    public static void main(String[] args) {
        int maxDist = 5;
        char cluster = 'A';
        int[] values = { 5 , 7 , 7 , 8 , 8 , 8 , 7 , 20 , 23 , 23 , 24 , 24 , 24 , 25 };
        int prev = values[0];
        System.out.print( cluster + " - " + prev + " ");
        for ( int i = 1 ; i < values.length ; i++ ) {
            if ( Math.abs( prev - values[i] ) >= maxDist ) {
                System.out.print( "\n" + ++cluster + " - " );
            }
            System.out.print( values[i] + " " );
            prev = values[i];
        }
    }
}

EDIT: This approach will work if the clusters are not too close, like in your example of values. The k-means requires a known k (number of clusters) wich was not mentioned in your question. After separating the clusters you easilly find the "centers" as mean-values.

stacker
+3  A: 

For instance, k-means clustering in R yields the following:

R> x <- c(5, 7, 7, 8, 8, 8, 7, 20, 23, 23, 24, 24, 24, 25)
R> kmeans(as.matrix(x), centers=2)
K-means clustering with 2 clusters of sizes 7, 7

Cluster means:
    [,1]
1 23.286
2  7.143

Clustering vector:
 [1] 2 2 2 2 2 2 2 1 1 1 1 1 1 1

Within cluster sum of squares by cluster:
[1] 15.429  6.857

Available components:
[1] "cluster"  "centers"  "withinss" "size"   
rcs
+2  A: 

plot the probability density (think histogram) with some smoothing factor then find the peaks (centres of clusters) and troughs (division between clusters)

wroscoe
+1  A: 

There are a HUGE number of good approaches to this problem, and the method you ultimately should use will depend on the type of data you're dealing with (i.e., how it's distributed, dimensionality of the data points, possibly overlapping clusters, robustness to outliers, etc.).

As was said, the first thing to try would be k-means clustering. You might also want to have a look at a simple variant called k-medoids (a.k.a. Partitioning Around Medoids (PAM)) which is more robust to outliers than k-means.

One thing to note about both k-means and k-medoids is the existence of the parameter k (the number of clusters). If you won't know the number of clusters a priori, there are a variety of techniques to select k automatically (cross-validation, silhouette score, etc.); see Cluster Analysis and Finite Mixture Models for a more comprehensive list of cluster analysis implementations in R.

My personal favorite clustering technique would be a Gaussian Mixture Model (GMM). I commonly use a good implementation of GMMs via an R package called MCLUST which identifies the number of clusters automatically using the Bayesian Information Criterion.

Once you select a method to identify cluster membership (i.e., which data points are grouped together into sets), you can then average them or do with the data as you will.

awesomo
+1 for GMM: a generalization of k-means.
Steve