views:

373

answers:

3

I'm writing an agglomerative clustering algorithm in java and having trouble with a remove operation. It seems to always fail when the number of clusters reaches half the initial number.

In the sample code below, clusters is a Collection<Collection<Integer>>.

      while(clusters.size() > K){
           // determine smallest distance between clusters
           Collection<Integer> minclust1 = null;
           Collection<Integer> minclust2 = null;
           double mindist = Double.POSITIVE_INFINITY;

           for(Collection<Integer> cluster1 : clusters){
                for(Collection<Integer> cluster2 : clusters){
                     if( cluster1 != cluster2 && getDistance(cluster1, cluster2) < mindist){
                          minclust1 = cluster1;
                          minclust2 = cluster2;
                          mindist = getDistance(cluster1, cluster2);
                     }
                }
           }

           // merge the two clusters
           minclust1.addAll(minclust2);
           clusters.remove(minclust2);
      }

After a few runs through the loop, clusters.remove(minclust2) eventually returns false, but I don't understand why.

I tested this code by first creating 10 clusters, each with one integer from 1 to 10. Distances are random numbers between 0 and 1. Here's the output after adding a few println statements. After the number of clusters, I print out the actual clusters, the merge operation, and the result of clusters.remove(minclust2).

Clustering: 10 clusters
[[3], [1], [10], [5], [9], [7], [2], [4], [6], [8]]
[5] <- [6]
true
Clustering: 9 clusters
[[3], [1], [10], [5, 6], [9], [7], [2], [4], [8]]
[7] <- [8]
true
Clustering: 8 clusters
[[3], [1], [10], [5, 6], [9], [7, 8], [2], [4]]
[10] <- [9]
true
Clustering: 7 clusters
[[3], [1], [10, 9], [5, 6], [7, 8], [2], [4]]
[5, 6] <- [4]
true
Clustering: 6 clusters
[[3], [1], [10, 9], [5, 6, 4], [7, 8], [2]]
[3] <- [2]
true
Clustering: 5 clusters
[[3, 2], [1], [10, 9], [5, 6, 4], [7, 8]]
[10, 9] <- [5, 6, 4]
false
Clustering: 5 clusters
[[3, 2], [1], [10, 9, 5, 6, 4], [5, 6, 4], [7, 8]]
[10, 9, 5, 6, 4] <- [5, 6, 4]
false
Clustering: 5 clusters
[[3, 2], [1], [10, 9, 5, 6, 4, 5, 6, 4], [5, 6, 4], [7, 8]]
[10, 9, 5, 6, 4, 5, 6, 4] <- [5, 6, 4]
false

The the [10, 9, 5, 6, 4, 5, 6, 4, ...] set just grows infinitely from there.

Edit: to clarify, I'm using a HashSet<Integer> for each cluster in clusters (a HashSet<HashSet<Integer>>).

A: 

The obvious problem there is that clusters.remove is probably using equals to find the element to remove. Unfortunately equals on collections generally compares whether the elements are the same, rather than if it is the same collection (I believe C# makes a better choice in this regard).

AN easy fix is to create clusters as Collections.newSetFromMap(new IdentityHashMap<Collection<Integer>, Boolean>()) (I think).

Tom Hawtin - tackline
Good point about equals, but even if it were using equals, why would the remove fail?
weiyin
+1  A: 

In the test shown, the remove fails the first time you try to remove a Collection containing more than one Integer. Is this always the case?

What is the concrete type of the Collection used?

Skip Head
Yes, you are on the right track. It does always fail the first time I try to remove a collection with more than one integer. I use a HashSet<Integer> for each cluster in clusters (HashSet<HashSet<Integer>>).
weiyin
That's odd. If you are using HashSet, why are you getting multiple values in the set.
Tom Hawtin - tackline
As I said above, a HashSet should not be able to contain duplicate objects. I think there is a deeper problem here.
Skip Head
And apparently retaining order.
Tom Hawtin - tackline
+5  A: 

Ah. When you alter a value that is already in a Set (or a Map key), then it is not necessarily in the right position and hash codes will be cached. You need to remove it, alter it and then re-insert it.

Tom Hawtin - tackline
Yes, you got it! The solution is to create a new cluster, add all the elements from minclust1 and minclust2, remove minclust1 and minclust2 from clusters, and add the new cluster. It looks like it's a bad idea to alter objects in a HashSet.
weiyin
Excellent. Immutability rocks. Technically you could alter an element is a HashSet so long as you don't upset its equals and hashCode, but those should depend upon either all or none of the data. If you had HashSet<java.awt.Component> you could alter the components without fear.
Tom Hawtin - tackline