views:

839

answers:

6

Hi,

I'm trying to programm a k-means algorithm in Java. I have calculated a number of arrays, each of them containing a number of coeficients. I need to use a k-means algorithm in order to group all this data. Do you know any implementation of this algorithm?

Thanks

+2  A: 

There's a very nice Python implementation of K-means clustering in "Programming Collective Intelligence". I highly recommend it.

I realize that you'll have to translate to Java, but it doesn't look to be too difficult.

duffymo
Thanks. I've been looking for a practical companion to my (old) machine learning textbook for some time now.
hythlodayr
+1  A: 

I haven't studied the code myself, but there's a multithreaded K-means implementation given in this JavaWorld article that looks pretty instructive.

jtb
+1 - nice find. "PCI" is still recommended, because it's got a lot of great stuff besides K-means.
duffymo
+2  A: 

Classification, Clustering and grouping are well developed areas of IR. There is a very good (Java) library/software (open source) here Called WEKA. There are several algorithms for clustering there. Although there is a learning curve, it might useful when you encounter harder problems.

minoriole
A: 
ldog
A: 

OpenCV is one of the most horribly written libraries I've ever had to use. On the other hand, Matlab does it very neatly.

If you have to code it yourself, the algorithm is incredibly simple for how efficient it is.

  1. Pick number of clusters (k)
  2. Make k points (they're going to be the centroids)
  3. Randomize all these points location
  4. Calculate Euclidean distance from each point to all centroids
  5. Assign 'membership' of each point to the nearest centroid
  6. Establish the new centroids by averageing locations of all points belonging to a given cluster
  7. Goto 4 Until convergence is achieved, or changes made are irrelevant.
Marcin
Using OpenCV for KMeans might be overkill, but I don't see how OpenCV is "horribly" written. It may not be as easy to use as matlab (matlab is proprietory, slow and meant to be an easy way to test out algorithms using the large amount of algorithms already available to you in matlab) but it is for sure way faster than matlab, simply by virtue of being coded in C.
ldog
+1  A: 

Really, KMeans is a really easy algorithm. Any good reason why not hand coding it yourself? I did it in Qt and then ported the code to plain old STL, without too much problems.

I am started to be a fan to Joel's idea: no external dependencies, so please feel free to tell me what's good about a large piece of software you don't control, and others on this question have already mentioned it's not a good piece of software/

Talk is cheap, real man show their code to the world: http://github.com/elcuco/data%5Fmining%5Fdemo

I should clean the code a little to be more generic, and current version is not ported to STL, but it's a start!

elcuco
Hi elcuco,I have coded it myself, but wanted to crosscheck the initialization part. I wanted to see how others implementations had assigned initial clusters. I also think it's not a good idea using a code you don't have control over. I'll keep digging, thank you all!
dedalo