views:

69

answers:

1

I am planning to use orange for kmeans clustering. I have gone through the tutorials, but I still have a couple of questions which I would like to ask:

I am dealing with clustering on vectors of high dimension. 1) Is there a cosine distance implemented? 2) I do not want to give zeros to empty values. I tried not having any zeros in empty fields and am getting the error: SystemError: 'orange.TabDelimExampleGenerator': the number of attribute types does not match the number of attributes

How do I indicate an empty value? 3) Is there a way to use incorporate an "ID" into the example table? I want to label my data by an ID (NOT classification) for easier reference. I do not the ID column to be my official part of my data.

4) Is there a way to output differently for kmeans clustering? I would much prefer something in this format: cluster1: [ , , ...] cluster2: [ , ... ] rather than just [1, 2, 3,1 , 2, ... ]

Thanks!

A: 

Four questions in one question is extremely awkward -- why not make a question one question? It's not as if it would cost you;-). Anyway, wrt "How do I indicate an empty value?", see the docs regarding attribute value of instances of Orange.Value:

If value is continuous or unknown, no descriptor is needed. For the latter, the result is a string '?', '~' or '.' for don't know, don't care and other, respectively.

I'm not sure if by empty you mean "don't know" or "don't care", but anyway you can indicate either. Take care about distances, however -- from this other page in the docs:

Unknown values are treated correctly only by Euclidean and Relief distance. For other measure of distance, a distance between unknown and known or between two unknown values is always 0.5.

The distances listed in this latter page are Hamming, Maximal, Manhattan, Euclidean and Relief (the latter is like Manhattan but with correct treatment of unknown values) -- no Cosine distance provided: you'll have to code it yourself.

For (4), with just a little Python code you can obviously format results in any way you want. The .clusters attribute of a KMeans object is a list, exactly as long as the number of data instances: if what you want is a list of lists of data instances, for example:

def loldikm(data, **k):
  km = orange.KMeans(data, **k)
  results = [[] for _ in km.centroids]
  for i, d in zip(km.clusters, data):
    results[i].append(d)
Alex Martelli