tags:

views:

170

answers:

2

I have a input file which are all floating point numbers to 4 decimal place.

i.e. 13359    0.0000    0.0000    0.0001    0.0001    0.0002`    0.0003    0.0007    ... 

(the first is the id). My class uses the loadVectorsFromFile method which multiplies it by 10000 and then int() these numbers. On top of that, I also loop through each vector to ensure that there are no negative values inside. However, when I perform _hclustering, I am continually seeing the error, "Linkage Z contains negative values".

I seriously think this is a bug because:

  1. I checked my values,
  2. the values are no where small enough or big enough to approach the limits of the floating point numbers and
  3. the formula that I used to derive the values in the file uses absolute value (my input is DEFINITELY right).

Can someone enligten me as to why I am seeing this weird error? What is going on that is causing this negative distance error?

=====

def loadVectorsFromFile(self, limit, loc, assertAllPositive=True, inflate=True):
    """Inflate to prevent "negative" distance, we use 4 decimal points, so *10000
    """
    vectors = {}
    self.winfo("Each vector is set to have %d limit in length" % limit)
    with open( loc ) as inf:
        for line in filter(None, inf.read().split('\n')):
            l = line.split('\t')
            if limit:
                scores = map(float, l[1:limit+1])
            else:
                scores = map(float, l[1:])

            if inflate:        
                vectors[ l[0]] = map( lambda x: int(x*10000), scores)     #int might save space
            else:
                vectors[ l[0]] = scores                           

    if assertAllPositive:
        #Assert that it has no negative value
        for dirID, l in vectors.iteritems():
            if reduce(operator.or_, map( lambda x: x < 0, l)):
                self.werror( "Vector %s has negative values!" % dirID)
    return vectors

def main( self, inputDir, outputDir, limit=0,
        inFname="data.vectors.all", mappingFname='all.id.features.group.intermediate'):
    """
    Loads vector from a file and start clustering
    INPUT
        vectors is { featureID: tfidfVector (list), }
    """
    IDFeatureDic = loadIdFeatureGroupDicFromIntermediate( pjoin(self.configDir, mappingFname))
    if not os.path.exists(outputDir):
        os.makedirs(outputDir)

    vectors = self.loadVectorsFromFile( limit, pjoin( inputDir, inFname))
    for threshold in map( lambda x:float(x)/30, range(20,30)):
        clusters = self._hclustering(threshold, vectors)
        if clusters:
            outputLoc = pjoin(outputDir, "threshold.%s.result" % str(threshold))
            with open(outputLoc, 'w') as outf:
                for clusterNo, cluster in clusters.iteritems():
                    outf.write('%s\n' % str(clusterNo))
                    for featureID in cluster:
                        feature, group = IDFeatureDic[featureID]
                        outline = "%s\t%s\n" % (feature, group)
                        outf.write(outline.encode('utf-8'))
                    outf.write("\n")
        else:
            continue

def _hclustering(self, threshold, vectors):
    """function which you should call to vary the threshold
    vectors:    { featureID:    [ tfidf scores, tfidf score, .. ]
    """
    clusters = defaultdict(list)
    if len(vectors) > 1:
        try:
            results = hierarchy.fclusterdata( vectors.values(), threshold, metric='cosine')
        except ValueError, e:
            self.werror("_hclustering: %s" % str(e))
            return False

        for i, featureID in enumerate( vectors.keys()):
+2  A: 

I'm pretty sure that this is because you are using the cosine metric when you are calling fclusterdata. Try using euclidean and see if the error goes away.

The cosine metric can go negative if the dot product of two vectors in your set is greater than 1. Since you are using very large numbers and normalizing them, I'm pretty sure that the dot products are greater than 1 a lot of the time in your data set. If you want to use the cosine metric, then you'll need to normalize your data such that the dot product of two vectors is never greater than 1. See the formula on this page to see what the cosine metric is defined as in Scipy.

Edit:

Well, from looking at the source code I think that the formula listed on that page isn't actually the formula that Scipy uses (which is good because the source code looks like it is using the normal and correct cosine distance formula). However, by the time it creates to the linkage, there are clearly some negative values in the linkage for whatever reason. Try finding the distance between your vectors with scipy.spatial.distance.pdist() with method='cosine' and check for negative values. If there aren't any, then it has to do with how the linkage is formed using the distance values.

Justin Peel
Great answer. Concerning "normalize your data", what are my options in normalizing my data such that I can still use the cosine distance native in scipy? I have tried calculating without any form of normailization, (using just the native tfidf values). Needless to say, the problem still persists because of the inaccuracies of the floating point number added at such great lengths. What would you recommend me doing?
disappearedng
First, you should check to see where the problem is. Is it after the distances are calculated? If the cosine method is done correctly (which I think that it is now in spite of the documentation saying otherwise), then no normalization is needed. By the way, try using 'old_cosine' as your metric and see if you still get the error.
Justin Peel
A: 

I'm not able to improve the answer of Justin, but another point of note is your data handling.

You say you do something like int( float("0.0003") * 10000 ) to read the data. But if you do that you'd get not 3 but 2.9999999999999996. That's because the floating point inaccuracies just get multiplied.

A better, or at least more accurate. way would be by doing the multiplication in the string. That is, using string manipulation to get from 0.0003 to 3.0 and so forth.

Perhaps there even is an Python data type extension somewhere which can read in this kind of data without loss of precision on which you can perform the multiplication before conversion. I'm not at home in SciPy/numerics so I don't know.

EDIT

Justin commented that there is a decimal type build within python. And that can interpret strings, multiply with integers and convert to float (I tested that). That being the case I would recommend updating your logic like:

factor = 1
if inflate:
  factor = 10000
scores = map(lambda x: float(decimal.Decimal(x) * factor), l[1:])

That would at reduce your rounding problems a bit.

extraneon
Yes, there is such a module. It is called decimal. http://docs.python.org/library/decimal.html
Justin Peel