Grouping an ordered dataset into minimal number of clusters

views:

answers:

+2 Q:

Grouping an ordered dataset into minimal number of clusters

I have an ordered list of weighted items, weight of each is less-or-equal than N. I need to convert it into a list of clusters. Each cluster should span several consecutive items, and total weight of a cluster has to be less-or-equal than N.

Is there an algorithm which does it while minimizing the total number of clusters and keeping their weights as even as possible?

E.g. list [(a,5),(b,1),(c,2),(d,5)], N=6 should be converted into [([a],5),([b,c],3),([d],5)]

+1 A:

Your problem is under-specified.

The issue is that you are trying to optimize two different properties of the resulting data, and these properties may be in opposition to one another. For a given set of data, it may be that the most even distribution has many clusters, and that the smallest number of clusters has a very uneven distribution.

For example, consider: [(a,1),(b,1),(c,1),(d,1),(e,1)], N=2

The most even distribution is [([a],1),([b],1),([c],1),([d],1),([e],1)]

But the smallest number of clusters is [([a,b],2),([c,d],2),([e],1)]

How is an algorithm supposed to know which of these (or which clustering in between them) you want? You need find some way to quantify the tradeoff that you are willing to accept between number of clusters and evenness of distribution.

You can create an example with an arbitrarily large discrepancy between the two possibilities by creating any set with 2k + 1 elements, and assigning them all the value N/2. This will lead to the smallest number of clusters being k+1 clusters (k of 2 elements and 1 of 1) with a weight difference of N/2 between the largest and smallest clusters. And then the most even distribution for this set will be 2k + 1 clusters of 1 element each, with no weight difference.

Edit: Also, "evenness" itself is not a well-defined idea. Are you looking to minimize the largest absolute difference in weights between clusters, or the mean difference in weights, or the median difference in weights, or the standard deviation in weights?

Tyler McHenry 2010-07-28 16:15:31

Nopey, these targets are not in contradiction. Optimal solution in your case can clearly be seen, it's [a,b],[c,d], offering both minimal number of clusters and even distribution.

yk4ever 2010-07-28 16:20:38

Fixed the example.

Tyler McHenry 2010-07-28 16:27:36

That's better, but anyway - I think my original question states pretty clearly that minimization is a higher priority.

yk4ever 2010-07-28 20:59:30

Okay, but if you want an answer other than "just use a dynamic program for bin packing", which will end up minimizing clusters and ignoring evenness, you will still have to put the idea of evenness and the relationship between it and minimizing clusters into terms a computer can understand. Your original question is like asking a computer to "produce a picture using as few colors as possible that is as pretty as possible" -- the computer doesn't know what "prettiness" is or how you value prettiness versus number of colors unless you put it in quantifiable terms.

Tyler McHenry 2010-07-28 21:07:51

You are of course correct. There can be some "evenness" metric invented, obvious candidates are "mean square error" or "maximal minimum". But I didn't want to be too specific, since I wanted to hear about all relevant algorithms, and, if there's any, try to adapt it to my goals.

yk4ever 2010-07-29 06:58:33

+2 A:

Since the dataset is ordered, one possible approach is to assign a "badness" score to each possible cluster and use a dynamic program reminiscent of Knuth's word wrapping ( http://en.wikipedia.org/wiki/Word_wrap ) to minimize the sum of the badness scores. The badness function will let you explore tradeoffs between minimizing the number of clusters (larger constant term) and balancing them (larger penalty for deviating from the average number of items).

2010-07-28 18:19:25

Great, I thought about dynamic programming too. Alas, it seems to be too much of a work for my task, so I think I'm gonna bruteforce it or just approximate.

yk4ever 2010-07-28 20:57:18

ansaurus

tags:

views:

answers:

Grouping an ordered dataset into minimal number of clusters

related questions