views:

30

answers:

2

I want to use multithreading to make my script faster... I'm still new to this. The Python doc assumes you already understand threading and what-not.

So...

I have code that looks like this

from itertools import izip
from multiprocessing import Pool

p = Pool()
for i, j in izip(hugeseta, hugesetb):
    p.apply_async(number_crunching, (i, j))

Which gives me great speed!

However, hugeseta and hugesetb are really huge. Pool keeps all of the i s and j s in memory after they've finished their job (basically, print output to stdout). Is there any to del i, and j after they complete?

A: 

The del statement deletes object references, so can free up memory when the garbage collector runs.

from itertools import izip
from multiprocessing import Pool

p = Pool()
for i, j in izip(hugeseta, hugesetb):
    p.apply_async(number_crunching, (i, j))

del i, j
zdav
Where would I put del?I tried checking the pool for dead workers, but there are never any more workers than cores.So where are all the _i_ s and _j_ s being stored?
Austin
@Austin Just `del` i and j as soon as you are done with them.
zdav
@Zdav I meant while the loop is running. If I run without Pool, memory use flattens out. If I run it with Pool, old _i_ s and _j_ s don't get garbage collected.
Austin
A: 

Not really an answer but I used Pool.imap()instead:

for i in p.imap(do, izip(Fastitr(seqsa, filetype='fastq'), \
        Fastitr(seqsb, filetype='fastq'))):
    pass

Which works beautifully and garbage collects as expected however it feels funny having a for loop with nothing but pass actually do something useful.

Austin