views:

152

answers:

2

After performing a cluster analysis to my dataset (a dataframe named data.matrix), I added a new column, named cluster, at the end (col 27) containing the cluster name that each instance belongs to.

What I want now, is a representative instance from each cluster. I tried to find the instance having the smallest euclidean distance from the cluster's centroid (and repeat the procedure for each one of my clusters)

This is what I did. Can you think of other -perhaps more elegant- ways? (assume numeric columns with no nulls).

clusters <- levels(data.matrix$cluster)
cluster_col = c(27)

for (j in 1:length(clusters)) {
    # get the subset for cluster j
    data = data.matrix[data.matrix$cluster == clusters[j],]

    # remove the cluster column
    data <- data[,-cluster_col]

    # calculate the centroid
    cent <- mean(data)

    # copy data to data.matrix_cl, attaching a distance column at the end
    data.matrix_cl <- cbind(data, dist = apply(data, 1, function(x) {sqrt(sum((x - cent)^2))}))

    # get instances with min distance
    candidates <- data.matrix_cl[data.matrix_cl$dist == min(data.matrix_cl$dist),]

    # print their rownames
    print(paste("Candidates for cluster ",j))
    print(rownames(candidates))
}
A: 

Is the technique you're interested in 'k-means clustering'? If so, here's how the centroids are calculated at each iteration:

  1. choose a k value (an integer that specifies the number of clusters to divide your data set);

  2. random select k rows from your data set, those are the centroids for the 1st iteration;

  3. calculate the distance that each data point is from each centroid;

  4. each data point has a 'closest centroid', that determines its 'group';

  5. calculate the mean for each group--those are the new centroids;

  6. back to step 3 (stopping criterion is usually based on comparison with the respective centroid values in successive loops, i.e., if they values change not more than 0.01%, then quit).

Those steps in code:

# toy data set
mx = matrix(runif60, 10, 99), nrow=12, ncol=5, byrow=F)
cndx = sample(nrow(mx), 2)
# the two centroids at iteration 1
cn1 = mx[cndx[1],]
cn2 = mx[cndx[2],]
# to calculate Pearson similarity
fnx1 = function(a){sqrt((cn1[1] - a[1])^2 + (cn1[2] - a[2])^2)}
fnx2 = function(a){sqrt((cn2[1] - a[1])^2 + (cn2[2] - a[2])^2)}
# calculate distance matrix
dx1 = apply(mx, 1, fnx1)
dx2 = apply(mx, 1, fnx2)
dx = matrix(c(dx1, dx2), nrow=2, ncol=12)
# index for extracting the new groups from the data set
ndx = apply(dx, 1, which.min)
group1 = mx[ndx==1,]
group2 = mx[ndx==2,]
# calculate the new centroids for the next iteration
new_cnt1 = apply(group1, 2, mean)
new_cnt2 = apply(group2, 2, mean)
doug
No, I'm not asking the way to perform a k-means clustering. My cluster analysis has been completed and in each instance has been assigned the cluster it belongs to. Now that the clusters have been defined, I am asking which instance (from each cluster) is closer to it's cluster's centroid (and could be considered as an ideal representaive of the instances in the cluster)
gd047
Well that's what i gave you--see in particular the last two lines of my code. Each of those lines give you an "ideal representative of the instances in the cluster." I don't think there is another meaning for "ideal representative" other than the value of the cluster centroid that would have been calculated had you continued for another iteration.
doug
No, what I want is the id of a real instance, not a mean value. But the instance that's the nearest one to the final centroid.
gd047
you really don't see how to get that from my code above? (i) Start with dx1 above (distances for each data point and a single centroid); (ii) ndx = order(dx); (iii) mx[ndx,] (this gives you the data points ranked by distance from that centroid from closest to farthest; (iv) to get the closest data point: mx[ndx,][1,]
doug
I think the first line of the toy data should read: mx = matrix(runif(60, 10, 99), nrow=12, ncol=5, byrow=F)
JD Long
+5  A: 

At first I don't now if you distance formula is alright. I think there should be sqrt(sum((x-cent)^2)) or sum(abs(x-cent)). I assumed first. Second thought is that just printing solution is not good idea. So I first compute, then print. Third - I recommend using plyr but I give both (with and without plyr) solutions.

# Simulated data:
n <- 100
data.matrix <- cbind(
  data.frame(matrix(runif(26*n), n, 26)),
  cluster=sample(letters[1:6], n, replace=TRUE)
)
cluster_col <- which(names(data.matrix)=="cluster")

# With plyr:
require(plyr)
candidates <- dlply(data.matrix, "cluster", function(data) {
  dists <- colSums(laply(data[, -cluster_col], function(x) (x-mean(x))^2))
  rownames(data)[dists==min(dists)]
})

l_ply(names(candidates), function(c_name, c_list=candidates[[c_name]]) {
    print(paste("Candidates for cluster ",c_name))
    print(c_list)
})

# without plyr
candidates <- tapply(
  1:nrow(data.matrix),
  data.matrix$cluster,
  function(id, data=data.matrix[id, ]) {
    dists <- rowSums(sapply(data[, -cluster_col], function(x) (x-mean(x))^2))
    rownames(data)[dists==min(dists)]
  }
)

invisible(lapply(names(candidates), function(c_name, c_list=candidates[[c_name]]) {
    print(paste("Candidates for cluster ",c_name))
    print(c_list)
}))
Marek
You are right about distance formula, thank you. I will correct it. Thank you foe your answers too.
gd047
I really like that the example here is able to be run easily. The structure of the answer makes learning easier. And the with/without plyr really shows how plyr makes writing concise code easier.
JD Long