views:

274

answers:

3

Dear everyone, I have successfully debugged my own memory leak problems. However, I have noticed some very strange occurence.

    for fid, fv in freqDic.iteritems():
        outf.write(fid+"\t")                #ID
        for i, term in enumerate(domain):   #Vector
            tfidf = self.tf(term, fv) * self.idf( term, docFreqDic)
            if i == len(domain) - 1:
                outf.write("%f\n" % tfidf)
            else:
                outf.write("%f\t" % tfidf)
        outf.flush()
        print "Memory increased by", int(self.memory_mon.usage()) - startMemory

    outf.close()

def tf(self, term, freqVector):
    total = freqVector[TOTAL]
    if total == 0:
        return 0
    if term not in freqVector:      ##  When you don't have these lines memory leaks occurs
        return 0                    ##
    return float(freqVector[term]) / freqVector[TOTAL]


def idf(self, term, docFrequencyPerTerm):
    if term not in docFrequencyPerTerm:
        return 0        
    return math.log( float(docFrequencyPerTerm[TOTAL])/docFrequencyPerTerm[term])

Basically let me describe my problem: 1) I am doing tfidf calculations 2) I traced that the source of memory leaks is coming from defaultdict. 3) I am using the memory_mon from http://stackoverflow.com/questions/276052/how-to-get-current-cpu-and-ram-usage-in-python 4) The reason for my memory leaks is as follows: a) in self.tf, if the lines: if term not in freqVector: return 0 are not added that will cause the memory leak. (I verified this myself using memory_mon and noticed a sharp increase in memory that kept on increasing)

The solution to my problem was 1) since fv is a defaultdict, any reference to it that are not found in fv will create an entry. Over a very large domain, this will cause memory leaks.

I decided to use dict instead of default dict and the memory problem did go away.

My only puzzle is: since fv is created in "for fid, fv in freqDic.iteritems():" shouldn't fv be destroyed at the end of every for loop? I tried putting gc.collect() at the end of the for loop but gc was not able to collect everything (returns 0). Yes, the hypothesis is right, but the memory should stay fairly consistent with ever for loop if for loops do destroy all temp variables.

This is what it looks like with that two line in self.tf:

Memory increased by 12
Memory increased by 948
Memory increased by 28
Memory increased by 36
Memory increased by 36
Memory increased by 32
Memory increased by 28
Memory increased by 32
Memory increased by 32
Memory increased by 32
Memory increased by 40
Memory increased by 32
Memory increased by 32
Memory increased by 28

and without the the two line:

Memory increased by 1652
Memory increased by 3576
Memory increased by 4220
Memory increased by 5760
Memory increased by 7296
Memory increased by 8840
Memory increased by 10456
Memory increased by 12824
Memory increased by 13460
Memory increased by 15000
Memory increased by 17448
Memory increased by 18084
Memory increased by 19628
Memory increased by 22080
Memory increased by 22708
Memory increased by 24248
Memory increased by 26704
Memory increased by 27332
Memory increased by 28864
Memory increased by 30404
Memory increased by 32856
Memory increased by 33552
Memory increased by 35024
Memory increased by 36564
Memory increased by 39016
Memory increased by 39924
Memory increased by 42104
Memory increased by 42724
Memory increased by 44268
Memory increased by 46720
Memory increased by 47352
Memory increased by 48952
Memory increased by 50428
Memory increased by 51964
Memory increased by 53508
Memory increased by 55960
Memory increased by 56584
Memory increased by 58404
Memory increased by 59668
Memory increased by 61208
Memory increased by 62744
Memory increased by 64400

I look forward to your answer

EDIT: It appears that my terminology might have been wrong (or appear to be wrong).

  1. The memory leak I was referring to was NOT generated from freqVector[term]. (Looking up an nonexistent key in a defaultdict).
  2. The actual memory leak I was talking about was the memory leak from for fid, fv in freqDic.iteritems()!! I know fv increased in size because of 1), but it should still be destroyed at the end of the loop! memory shouldn't keep on expanding. Is this not memory leak?
+2  A: 

Iterating over freqDict does not generate new values, but passes references to the values already held by the dict. This means you add new values to the fv which is held by freqDict even after the loop.

Another solution would be to clear freqDict after looping over it.

In general, Python does pass everything by reference, although it sometimes it appears otherwise. Strings and Integers are immutable and the object, which they represent, gets replaced if they are changed.

ebo
Thank you. That makes sense.
disappearedng
A: 

It is not a memory leak, as memory is not leaking, it is being taken by your default dict e.g.

from collections import defaultdict

d = defaultdict(int)
for i in xrange(10**7):
    a = d[i]

Do you think it is a memory leak? you are assigning values to a dict and memory usage should increase due to it, it is similar to this

d = {}
for i in xrange(10**7):
    d[i] = 0

which is not a memory leak.

Anurag Uniyal
Please read my edit commen t
disappearedng
A: 

I suspect that the memory usage of Python might be increasing because floating point numbers are also objects in Python, and the interpreter maintains a freelist of floats which is unbounded and immortal. Therefore, whenever a float calculation results in a new float that did not occur before, Python allocates a new float object in the freelist and then it keeps the object around in case it might need it later.

See a similar discussion in the Python bug tracker here.

Tamás