views:

527

answers:

3

My goal is to analyze some corpus (twitter for the now) for emotional content. Just today I realized it would make a bit of sense to search for word stems as opposed to having an exhaustive list of emotional word stems. And so I've been exploring nltk.stem only to realize that there are 4 different stemmers. I'd like to ask the stackoverflow linguists whether LancasterStemmer, PorterStemmer, RegexpStemmer, RSLPStemmer, or WordNetStemmer is best preferably with some justification.

+3  A: 

RSLP is for portugese. I'm guessing you want english. Regexp would require you to develop your own stemming expressions, so I think that can be ignored as well. The WordnetStemmer requires that you know the part-of-speech for the word, so you'd have to do pos tagging first in order to use it. I've used the porter stemming algorithm and its pretty good, but the lancaster algorithm is newer, so it might be better. You might want to try using a combination of stemmers, where you choose the shortest stem from each stemmer. Anyway, bottom line is that PorterStemmer is a good default choice.

Jacob
A: 

I am trying a similar project and really like your answer, Jacob. But can you please tell me - how do I go about tagging every word in (say) a paragraph with its relevant POS tag? Is there a function that I can call to do that with the nltk? or does that have to be done in some other way (how?)

Thanks

=================

Being new to stackoverflow, I just discovered that deleting my post is not very easy. In any case, I just found a solution to this: the nltk.pos_tag(...) function

Thanks

inspectorG4dget
+1  A: 

It may be a bit different than you are asking, but the Nodebox Lingustics library contains an is_emotive() function which seems to check words to see if they are recursive hyponyms of certain emotional words. From commonsense.py

    ekman = ["anger", "disgust", "fear", "joy", "sadness", "surprise"]
    other = ["emotion", "feeling", "expression"]

Not a stemmer, but an interesting approach to check out.

tomcat23