views:

162

answers:

6

I have a couple of numerical datasets that I need to create a concept hierarchy for. For now, I have been doing this manually by observing the data (and a corresponding linechart). Based on my intuition, I created some acceptable hierarchies.

This seems like a task that can be automated. Does anyone know if there is an algorithm to generate a concept hierarchy for numerical data?


To give an example, I have the following dataset:

Bangladesh     521
Brazil         8295
Burma          446
China          3259
Congo          2952
Egypt          2162
Ethiopia       333
France         46037
Germany        44729
India          1017
Indonesia      2239
Iran           4600
Italy          38996
Japan          38457
Mexico         10200
Nigeria        1401
Pakistan       1022
Philippines    1845
Russia         11807
South Africa   5685
Thailand       4116
Turkey         10479
UK             43734
US             47440
Vietnam        1042

alt text

for which I created the following hierarchy:

  • LOWEST ( < 1000)
  • LOW (1000 - 2500)
  • MEDIUM (2501 - 7500)
  • HIGH (7501 - 30000)
  • HIGHEST ( > 30000)
+5  A: 

Maybe you need a clustering algorithm?

Quoting from the link:

Cluster analysis or clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense. Clustering is a method of unsupervised learning, and a common technique for statistical data analysis used in many fields

Eli Bendersky
Thanks, that does seem to be what I need. I'm reading into it now.
Christophe Herreman
The problem with clustering this dataset (well, any dataset that isn't actually points in some space) is going to be choosing a proper distance metric for whatever algorithm you go with. I would guess a simple Euclidian distance is going to cause issues given that you're looking for small ranges (1000-2500) in some areas where they're more closely spaced and much larger (7501-30000) where they're not. Maybe something like Euclidean over the log space? It should be easy to give it a go at least.
Dusty
+3  A: 

I think you're looking for something akin to data discretization that's fairly common in AI to convert continuous data (or discrete data with such a large number of classes as to be unwieldy) into discrete classes.

I know Weka uses Fayyad & Irani's MDL Method as well as Kononeko's MDL method, I'll see if I can dig up some references.

Dusty
Thanks for the info.
Christophe Herreman
+1 for discretization idea, although the MDL-/entropy-based methods you mentioned are both supervised discretizations which is not the case here..
Amro
Yeah, that's a good call. Last time I needed to do any discretization was to train a naive bayes classifier (supervised, obviously).
Dusty
+4  A: 

Jenks Natural Breaks is a very efficient single dimension clustering scheme: http://www.spatialanalysisonline.com/OUTPUT/html/Univariateclassificationschemes.html#_Ref116892931

As comments have noted, this is very similar to k-means. However, I've found it even easier to implement, particularly the variation found in Borden Dent's Cartography: http://www.amazon.com/Cartography-Thematic-Borden-D-Dent/dp/0697384950

John the Statistician
Interesting. Do you know if there is an implementation available?
Christophe Herreman
It's built into ArcGIS, if you have access to that.
John the Statistician
I don't unfortunately but thanks for the tip!
Christophe Herreman
The description of Jenk's natural breaks reminds me a lot of k-means, given that your data has only one dimension. The end of the article at http://en.wikipedia.org/wiki/K-means_clustering gives pointers to implementations of k-means.
mcdowella
A: 

I was wondering.

Apparently what you are looking for are clean breaks. So before launching yourself into complicated algorithms, you may perhaps envision a differential approach.

[1, 1.2, 4, 5, 10]

[20%, 333%, 25%, 100%]

Now depending on the number of breaks we are looking for, it's a matter of selecting them:

2 categories: [1, 1.2] + [4, 5, 10]
3 categories: [1, 1.2] + [4, 5] + [10]

I don't know about you but it does feel natural in my opinion, and you can even use a treshold approach saying that a variation less than x% is not worth considering a cut.

For example, here 4 categories does not seem to make much sense.

Matthieu M.
A: 

This is only a 1-dimensional problem, so there may be a dynamic programming solution. Assume that it makes sense to take the points in sorted order and then make n-1 cuts to generate n clusters. Assume that you can write down a penalty function f() for each cluster, such as the variance within the cluster, or the distance between min and max in the cluster. You can then minimise the sum of f() evaluated at each cluster. Work from one point at a time, from left to right. At each point, for 1..# clusters - 1, work out the best way to split the points so far into that many clusters, and store the cost of that answer and the location of its rightmost split. You can work this out for point P and cluster size c as follows: consider all possible cuts to the left of P. For each cut add f() evaluated on the group of points to the right of the cut to the (stored) cost of the best solution for cluster size c-1 at the point just to the left of the cut. Once you have worked your way to the far right, do the same trick once more to work out the best answer for cluster size c, and use the stored locations of rightmost splits to recover all the splits that give that best answer.

This might actually be more expensive than a k-means variant, but has the advantage of guaranting to find a global best answer (for your chosen f() under these assumptions).

mcdowella