ansaurus

Question

Algorithm to generate numerical concept hierarchy

Answer 1

+5 A:

Maybe you need a clustering algorithm?

Quoting from the link:

Cluster analysis or clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense. Clustering is a method of unsupervised learning, and a common technique for statistical data analysis used in many fields

Eli Bendersky 2010-03-25 16:57:57

Thanks, that does seem to be what I need. I'm reading into it now.

Christophe Herreman 2010-03-25 17:06:26

The problem with clustering this dataset (well, any dataset that isn't actually points in some space) is going to be choosing a proper distance metric for whatever algorithm you go with. I would guess a simple Euclidian distance is going to cause issues given that you're looking for small ranges (1000-2500) in some areas where they're more closely spaced and much larger (7501-30000) where they're not. Maybe something like Euclidean over the log space? It should be easy to give it a go at least.

Dusty 2010-03-25 17:11:21

Answer 2

+3 A:

I think you're looking for something akin to data discretization that's fairly common in AI to convert continuous data (or discrete data with such a large number of classes as to be unwieldy) into discrete classes.

I know Weka uses Fayyad & Irani's MDL Method as well as Kononeko's MDL method, I'll see if I can dig up some references.

Dusty 2010-03-25 17:06:28

Thanks for the info.

Christophe Herreman 2010-03-25 17:19:10

+1 for discretization idea, although the MDL-/entropy-based methods you mentioned are both supervised discretizations which is not the case here..

Amro 2010-03-26 02:35:52

Yeah, that's a good call. Last time I needed to do any discretization was to train a naive bayes classifier (supervised, obviously).

Dusty 2010-03-26 02:42:08

Answer 3

+4 A:

Jenks Natural Breaks is a very efficient single dimension clustering scheme: http://www.spatialanalysisonline.com/OUTPUT/html/Univariateclassificationschemes.html#_Ref116892931

As comments have noted, this is very similar to k-means. However, I've found it even easier to implement, particularly the variation found in Borden Dent's Cartography: http://www.amazon.com/Cartography-Thematic-Borden-D-Dent/dp/0697384950

John the Statistician 2010-03-25 17:11:31

Interesting. Do you know if there is an implementation available?

Christophe Herreman 2010-03-25 17:17:20

It's built into ArcGIS, if you have access to that.

John the Statistician 2010-03-25 17:18:30

I don't unfortunately but thanks for the tip!

Christophe Herreman 2010-03-25 17:37:26

The description of Jenk's natural breaks reminds me a lot of k-means, given that your data has only one dimension. The end of the article at http://en.wikipedia.org/wiki/K-means_clustering gives pointers to implementations of k-means.

mcdowella 2010-03-26 06:30:32

Answer 4

A:

Genetic hierarchical clustering algorithm

TheMachineCharmer 2010-03-26 06:33:15

Answer 5

A:

I was wondering.

Apparently what you are looking for are clean breaks. So before launching yourself into complicated algorithms, you may perhaps envision a differential approach.

[1, 1.2, 4, 5, 10]

[20%, 333%, 25%, 100%]

Now depending on the number of breaks we are looking for, it's a matter of selecting them:

2 categories: [1, 1.2] + [4, 5, 10]
3 categories: [1, 1.2] + [4, 5] + [10]

I don't know about you but it does feel natural in my opinion, and you can even use a treshold approach saying that a variation less than x% is not worth considering a cut.

For example, here 4 categories does not seem to make much sense.

Matthieu M. 2010-03-26 10:40:53

Answer 6

A:

This is only a 1-dimensional problem, so there may be a dynamic programming solution. Assume that it makes sense to take the points in sorted order and then make n-1 cuts to generate n clusters. Assume that you can write down a penalty function f() for each cluster, such as the variance within the cluster, or the distance between min and max in the cluster. You can then minimise the sum of f() evaluated at each cluster. Work from one point at a time, from left to right. At each point, for 1..# clusters - 1, work out the best way to split the points so far into that many clusters, and store the cost of that answer and the location of its rightmost split. You can work this out for point P and cluster size c as follows: consider all possible cuts to the left of P. For each cut add f() evaluated on the group of points to the right of the cut to the (stored) cost of the best solution for cluster size c-1 at the point just to the left of the cut. Once you have worked your way to the far right, do the same trick once more to work out the best answer for cluster size c, and use the stored locations of rightmost splits to recover all the splits that give that best answer.

This might actually be more expensive than a k-means variant, but has the advantage of guaranting to find a global best answer (for your chosen f() under these assumptions).

mcdowella 2010-03-27 06:34:14

ansaurus

tags:

views:

answers:

Algorithm to generate numerical concept hierarchy

related questions