ansaurus

Question

python multiprocessing - text processing

Answer 1

A:

Are you trying to parallelize the classification, the training, or both? You can probably make the word counting and scoring parallel fairly easily, but I'm not sure about the feature extraction & training. For the classification, I'd recommend execnet. I've had good results using it for parallel/distributed part-of-speech tagging.

The basic idea with execnet is that you'd train a single classifier once, then send it to each execnet node. Next, divide the files up to each node, then have it classify each file it's given. The results are then sent back to the master node. I haven't tried pickling a classifier yet, so I don't know for sure if this will work, but if a pos tagger can be pickled, I'd assume a classifier can be too.

Jacob 2010-06-20 22:52:31

i'd just started experimenting with the pickling - they are turning out to be rather hefty (100mb ish) though.I'll try and see if i can get multiprocessing to work somehow, else execnet seems like an alternative - i doubt the training can be parallelized (easily), but like you said, the other bits and bobs shouldnt be that diff.. hopefully.btw thanks for the stuff on streamhacker - its a treasure trove!

flyingcrab 2010-06-20 23:06:37

Answer 2

+1 A:

Regarding your stripped down version, are you using a different featx function than the one used in http://streamhacker.com/2010/06/16/text-classification-sentiment-analysis-eliminate-low-information-features/?

The exception most probably happens inside featx and multiprocessing just re-raises it, though it does not really include the original traceback which makes it a bit unhelpful.

Try running it without pool.map() first (i.e. negfeats = [feat(x) for x in words]) or include something in featx that you can debug.

If that still doesn't help, post the whole script you are working on in your original question (simplified already if possible) so others can run that and provide more directed answers. Note that the following code fragment actually works (adapting your stripped down version):

from nltk.corpus import movie_reviews
from multiprocessing import Pool

def featx(words):
    return dict([(word, True) for word in words])

if __name__ == "__main__":
    negids = movie_reviews.fileids('neg')
    posids = movie_reviews.fileids('pos')

    p = Pool(2)
    negfeats =[]
    posfeats =[]

    for f in negids:
        words = movie_reviews.words(fileids=[f]) 
        negfeats = p.map(featx, words)

    print len(negfeats)

Vin-G 2010-06-21 03:28:48

that was the problem i think - many thanks!

flyingcrab 2010-06-21 16:29:14

ansaurus

tags:

views:

answers:

python multiprocessing - text processing

related questions