ansaurus

Question

How to find center of clusters of numbers? statistics problem?

Answer 1

A:

Like that?

public class Cluster {
    public static void main(String[] args) {
        int maxDist = 5;
        char cluster = 'A';
        int[] values = { 5 , 7 , 7 , 8 , 8 , 8 , 7 , 20 , 23 , 23 , 24 , 24 , 24 , 25 };
        int prev = values[0];
        System.out.print( cluster + " - " + prev + " ");
        for ( int i = 1 ; i < values.length ; i++ ) {
            if ( Math.abs( prev - values[i] ) >= maxDist ) {
                System.out.print( "\n" + ++cluster + " - " );
            }
            System.out.print( values[i] + " " );
            prev = values[i];
        }
    }
}

EDIT: This approach will work if the clusters are not too close, like in your example of values. The k-means requires a known k (number of clusters) wich was not mentioned in your question. After separating the clusters you easilly find the "centers" as mean-values.

stacker 2010-01-08 11:56:32

Answer 2

+3 A:

For instance, k-means clustering in R yields the following:

R> x <- c(5, 7, 7, 8, 8, 8, 7, 20, 23, 23, 24, 24, 24, 25)
R> kmeans(as.matrix(x), centers=2)
K-means clustering with 2 clusters of sizes 7, 7

Cluster means:
    [,1]
1 23.286
2  7.143

Clustering vector:
 [1] 2 2 2 2 2 2 2 1 1 1 1 1 1 1

Within cluster sum of squares by cluster:
[1] 15.429  6.857

Available components:
[1] "cluster"  "centers"  "withinss" "size"

rcs 2010-01-08 12:38:15

Answer 3

+2 A:

plot the probability density (think histogram) with some smoothing factor then find the peaks (centres of clusters) and troughs (division between clusters)

wroscoe 2010-01-09 04:16:52

Answer 4

+1 A:

There are a HUGE number of good approaches to this problem, and the method you ultimately should use will depend on the type of data you're dealing with (i.e., how it's distributed, dimensionality of the data points, possibly overlapping clusters, robustness to outliers, etc.).

As was said, the first thing to try would be k-means clustering. You might also want to have a look at a simple variant called k-medoids (a.k.a. Partitioning Around Medoids (PAM)) which is more robust to outliers than k-means.

One thing to note about both k-means and k-medoids is the existence of the parameter k (the number of clusters). If you won't know the number of clusters a priori, there are a variety of techniques to select k automatically (cross-validation, silhouette score, etc.); see Cluster Analysis and Finite Mixture Models for a more comprehensive list of cluster analysis implementations in R.

My personal favorite clustering technique would be a Gaussian Mixture Model (GMM). I commonly use a good implementation of GMMs via an R package called MCLUST which identifies the number of clusters automatically using the Bayesian Information Criterion.

Once you select a method to identify cluster membership (i.e., which data points are grouped together into sets), you can then average them or do with the data as you will.

awesomo 2010-01-16 06:34:38

+1 for GMM: a generalization of k-means.

Steve 2010-01-18 06:17:21

ansaurus

tags:

views:

answers:

How to find center of clusters of numbers? statistics problem?

related questions