views:

102

answers:

2

hi,

I am trying to create a multiprocessing version of text categorization code i found here (amongst other cool things). I've appended the full code below.

I've tried a couple of things - tried a lambda function first, but it complained of not being serializable (!?), so attempted a stripped down version of the original code:

  negids = movie_reviews.fileids('neg')
  posids = movie_reviews.fileids('pos')

  p = Pool(2)
  negfeats =[]
  posfeats =[]

  for f in negids:
   words = movie_reviews.words(fileids=[f]) 
   negfeats = p.map(featx, words) #not same form as below - using for debugging

  print len(negfeats)

Unfortunately even this doesnt work - i get the following trace:

File "/usr/lib/python2.6/multiprocessing/pool.py", line 148, in map
    return self.map_async(func, iterable, chunksize).get()
File "/usr/lib/python2.6/multiprocessing/pool.py", line 422, in get
    raise self._value
ZeroDivisionError: float division

Any idea what i might be doing wrong? should i be using pool.apply_async instead (in of itself that doesnt seem to solve the problem either - but perhaps i am barking up the wrong tree) ?

import collections
import nltk.classify.util, nltk.metrics
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews

def evaluate_classifier(featx):
    negids = movie_reviews.fileids('neg')
    posids = movie_reviews.fileids('pos')

    negfeats = [(featx(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
    posfeats = [(featx(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

    negcutoff = len(negfeats)*3/4
    poscutoff = len(posfeats)*3/4

    trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
    testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]

    classifier = NaiveBayesClassifier.train(trainfeats)
    refsets = collections.defaultdict(set)
    testsets = collections.defaultdict(set)

    for i, (feats, label) in enumerate(testfeats):
            refsets[label].add(i)
            observed = classifier.classify(feats)
            testsets[observed].add(i)

    print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
    print 'pos precision:', nltk.metrics.precision(refsets['pos'], testsets['pos'])
    print 'pos recall:', nltk.metrics.recall(refsets['pos'], testsets['pos'])
    print 'neg precision:', nltk.metrics.precision(refsets['neg'], testsets['neg'])
    print 'neg recall:', nltk.metrics.recall(refsets['neg'], testsets['neg'])
    classifier.show_most_informative_features()
A: 

Are you trying to parallelize the classification, the training, or both? You can probably make the word counting and scoring parallel fairly easily, but I'm not sure about the feature extraction & training. For the classification, I'd recommend execnet. I've had good results using it for parallel/distributed part-of-speech tagging.

The basic idea with execnet is that you'd train a single classifier once, then send it to each execnet node. Next, divide the files up to each node, then have it classify each file it's given. The results are then sent back to the master node. I haven't tried pickling a classifier yet, so I don't know for sure if this will work, but if a pos tagger can be pickled, I'd assume a classifier can be too.

Jacob
i'd just started experimenting with the pickling - they are turning out to be rather hefty (100mb ish) though.I'll try and see if i can get multiprocessing to work somehow, else execnet seems like an alternative - i doubt the training can be parallelized (easily), but like you said, the other bits and bobs shouldnt be that diff.. hopefully.btw thanks for the stuff on streamhacker - its a treasure trove!
flyingcrab
+1  A: 

Regarding your stripped down version, are you using a different featx function than the one used in http://streamhacker.com/2010/06/16/text-classification-sentiment-analysis-eliminate-low-information-features/?

The exception most probably happens inside featx and multiprocessing just re-raises it, though it does not really include the original traceback which makes it a bit unhelpful.

Try running it without pool.map() first (i.e. negfeats = [feat(x) for x in words]) or include something in featx that you can debug.

If that still doesn't help, post the whole script you are working on in your original question (simplified already if possible) so others can run that and provide more directed answers. Note that the following code fragment actually works (adapting your stripped down version):

from nltk.corpus import movie_reviews
from multiprocessing import Pool

def featx(words):
    return dict([(word, True) for word in words])

if __name__ == "__main__":
    negids = movie_reviews.fileids('neg')
    posids = movie_reviews.fileids('pos')

    p = Pool(2)
    negfeats =[]
    posfeats =[]

    for f in negids:
        words = movie_reviews.words(fileids=[f]) 
        negfeats = p.map(featx, words)

    print len(negfeats)
Vin-G
that was the problem i think - many thanks!
flyingcrab