ansaurus

Question

Answer 1

+2 A:

I'm pretty sure that this is because you are using the cosine metric when you are calling fclusterdata. Try using euclidean and see if the error goes away.

The cosine metric can go negative if the dot product of two vectors in your set is greater than 1. Since you are using very large numbers and normalizing them, I'm pretty sure that the dot products are greater than 1 a lot of the time in your data set. If you want to use the cosine metric, then you'll need to normalize your data such that the dot product of two vectors is never greater than 1. See the formula on this page to see what the cosine metric is defined as in Scipy.

Edit:

Well, from looking at the source code I think that the formula listed on that page isn't actually the formula that Scipy uses (which is good because the source code looks like it is using the normal and correct cosine distance formula). However, by the time it creates to the linkage, there are clearly some negative values in the linkage for whatever reason. Try finding the distance between your vectors with scipy.spatial.distance.pdist() with method='cosine' and check for negative values. If there aren't any, then it has to do with how the linkage is formed using the distance values.

Justin Peel 2010-04-07 05:18:54

Great answer. Concerning "normalize your data", what are my options in normalizing my data such that I can still use the cosine distance native in scipy? I have tried calculating without any form of normailization, (using just the native tfidf values). Needless to say, the problem still persists because of the inaccuracies of the floating point number added at such great lengths. What would you recommend me doing?

disappearedng 2010-04-07 08:47:29

First, you should check to see where the problem is. Is it after the distances are calculated? If the cosine method is done correctly (which I think that it is now in spite of the documentation saying otherwise), then no normalization is needed. By the way, try using 'old_cosine' as your metric and see if you still get the error.

Justin Peel 2010-04-07 14:05:47

Answer 2

A:

I'm not able to improve the answer of Justin, but another point of note is your data handling.

You say you do something like int( float("0.0003") * 10000 ) to read the data. But if you do that you'd get not 3 but 2.9999999999999996. That's because the floating point inaccuracies just get multiplied.

A better, or at least more accurate. way would be by doing the multiplication in the string. That is, using string manipulation to get from 0.0003 to 3.0 and so forth.

Perhaps there even is an Python data type extension somewhere which can read in this kind of data without loss of precision on which you can perform the multiplication before conversion. I'm not at home in SciPy/numerics so I don't know.

EDIT

Justin commented that there is a decimal type build within python. And that can interpret strings, multiply with integers and convert to float (I tested that). That being the case I would recommend updating your logic like:

factor = 1
if inflate:
  factor = 10000
scores = map(lambda x: float(decimal.Decimal(x) * factor), l[1:])

That would at reduce your rounding problems a bit.

extraneon 2010-04-07 06:16:05

Yes, there is such a module. It is called decimal. http://docs.python.org/library/decimal.html

Justin Peel 2010-04-07 14:06:50

ansaurus

tags:

views:

answers:

Scipy Negative Distance? What?

related questions