tags:

views:

167

answers:

5
+1  Q: 

Design Problem

Recently I was faced with this interview question (K-Means Clustering solution). The design I came up with did not meet the expectations of the interviewer (to put simply I didnt get the job because I lost to another candidate on this design problem). I am wondering how many different / efficient / simply solutions can the SO community come up with (by doing this I am hoping to hone my skills):

To implement a simple algorithm to cluster people according to their weight and height. The data set includes a list of people with their weights and heights like so:

Person   Weight   Height
         (kg)     (inches)
Person 1 70        70

Person 2 75        80

Person 3 120       85

You can plot the data as a 2 dimensional data. Weight being one dimension and height being the other dimension. Weight can range from a minimum of 50kg to 150kg. Height can range from a minimum of 38inches to 90inches

Algorithm:

The algorithm (called K-means clustering) will cluster data into K groups goes as such:

  1. Start with K clusters. Each cluster is defined by its center point which will start of as random weight and random height. Pick random numbers from within the corresponding ranges defined above.

  2. For each person Calculate distance to center of each cluster using formula distance = sqrt(pow((wperson−wcenter), 2) + (pow(hperson−hcenter),2)) where wperson = weight of person, hperson = height of person wcenter = weight of cluster center point, hcenter = height of cluster center point

  3. Assign the person to the cluster with the shortest distance to center point of cluster

  4. After end of step 2, you will end up with K clusters each assigned with a set of people

  5. For each cluster, set the weight and height of the center point to the average of the people in the cluster wcenter = (sum of weight of each person in cluster)/(number of people in cluster) hcenter = (sum of height of each person in cluster)/number of people in cluster)

  6. Repeat steps 2 to 5 for 1000 iterations, then print out following information for each cluster.

    weight and height of center of cluster. list of people in cluster.

I am not looking for a implementation/solution but for a high level design. can you list the interfaces / classes etc. I dont want to give my solution now, but will post it later in the day?

A: 

I'm not sure what your question actually is, the steps you point out effectively define the algorithm you're talking about.

A better idea may be to include exactly what you did then people can give you some hints / tips as to where you might have gone wrong or what they would have done differently.

Robin Day
+1  A: 

Well, I would first tackle all the constants/magic numbers that reduce the reusability of the algorithm:

  • instead of a fixed number of iterations, use a stopping criterion (e.g., if clusters don't change too much, terminate)

  • don't restrict yourself to 2-dim data, use vectors

  • let the user define the number of clusters to be found

Then, you could hide some specifics behind interfaces, e.g. the distance might be calculated differently (for example, it might at some point have to cope with values other than double).

On the other hand, if you really have this simple problem, some of these generalizations might well be overkill - but that's what I would discuss with someone telling me to implement this algorithm.

__roland__
I dont u/s what u mean by dont restrict yourself to 2-dim data, use vectors?Agree with your logic to terminate when they converge and not go thru all the 1000 iterations.
newbee
I just meant that you might run into a similar problem where you have to cluster users w.r.t. height, age, weight (= three dimensions), height, age, weight, net income etc. So why not generalize the algorithm to cope with n-dimensional vectors, i.e. arbitrarily many properties of users/entities?
__roland__
A: 

That sounds like a really good way to do it. K-means will usually converge quickly (though not necessarily to the global optimum), so my one suggestion would be to run the algorithm until no more changes occur, rather than a fixed number of 1000 iterations. You could then repeat the entire process a few times with different random starting points.

One weakness of k-means is that it does require you to specify (i.e. guess) an appropriate value for k up-front. I think you would get points for asking the interviewer what an appropriate value for k would be, or, if there is no way to know, describing some goodness-of-fit measure and then calculating that measure for different values of k to find a "just low enough" value.

j_random_hacker
The question I was posed was not to discuss the merits or demerits of k-means but simply to come up with a design.
newbee
When proposing a design, you need to consider its merits and demerits. Don't you think?
j_random_hacker
+1  A: 

You can create the following classes:

  • Person to store data about persons and centers. Properties: id, weight and height. Method: calculateDistance
  • Cluster to store one center and a list of persons: Properties: center and list of Person. Method: calculateCenter.
  • KCluster to hold your algorithm and store a list of clusters: Property: list of Cluster. Methods: generateClusters.
fbinder
This is very similar to what I came up with, except that I had a few more abstract classes / interfaces to make the solution extensible (that was one feedback I received to my intial design). I have to tell you that the design was termed sloppy!
newbee
maybe you had too many!?
DrG
It was explicity asked that you create an extensible design?
fbinder
+2  A: 
Chap