views:

88

answers:

3

Can you show me a simple example using http://www.nltk.org/code to determine if a string about a happy or upset mood?

A: 

Nopey.

This is a task far beyond the capabilities of NLTK or any grammatical parser that is known or can be realistically imagined. Look at the NLTK Book to see what sorts of tasks it can accomplish which are far, far from your stated purpose.

As a cheap example:

I really enjoyed using your paper to train my dog.

Parse that up with NLTK and you can get

[('I', 'PRP'), ('really', 'RB'), ('enjoyed', 'VBD'), 
 ('using', 'VBG'), ('your', 'PRP$'), ('paper', 'NN'), 
 ('to', 'TO'), ('train', 'VB'), ('my', 'PRP$'), ('dog', 'NN')]

Where the parse tree would tell me that 'enjoyed' is the central (past-tense) verb of the simple sentence. To enjoy something is good. To train something is generally a good thing. Gerunds, nouns, comparatives, and such are relatively neutral. So give this a Good score of 0.90.

Except I really mean that I either hit my dog with your paper or let it excrete on the paper which you'd probably consider a not Good thing.

Hire a person for this recognition task.

Added for those who imagine that even trained classifiers are of much use:

Classify this real entry from a real customer review corpus using any classifier you like trained on any dataset you like:

This camera keeps on autofocussing in auto mode with a buzzing sound which can't be stopped. It would be really good if they have given an option to stop this autofocussing. If you want to have the date and time on the image, it's only through their software which reads the image's date and time from the image's meta-data. So if you use your card reader and copy images - you got to once again open them through their software to put the date and time. In that too, there isn't a direct way to add date and time - you got to say 'print images' to a different directory in which there is an option to specify the date and time . Even the slightest of the shakes totally distorts your image. Indoor images weren't so clear. You got to have flash 'on' to get it even though your room is well lit. The lens cap is a really annoying. the movie clips taken will always have some 'noise' in it - you can't avoid that.

The worst mood classification I obtained was "totally equivocal" yet humans can easily determine that this is anything but complimentary. This wasn't a randomly picked datum, rather one that was selected for negative bias without "hate" or "suxz" or similar.

msw
see also http://en.wikipedia.org/wiki/Sentiment_analysis
msw
I wouldn't say this is beyond NLTK. My first thought was sentiment analysis, which you linked to. Given a sizable training corpus, you could train a classifier to give you a decent approximation of "mood".
Chris S
@Chris S: But that is not what the question asked; it asked for a simple example for which there are none. Even classifiers fall down on real textual input as the various corpora linked to by Wikipedia. For a domain where simple declaratives are still troublesome, coping with nuance, sarcasm, implication, and damning with faint praise is **really** hard.
msw
You may be overcomplicating the question. He's not asking about identifying nuance, sarcasm, etc. He's only interested in the vague boolean labels of "happy" or "upset", which he can easily define by manually tagging a sample of sentences. I agree, this might not be "trivial", but I wouldn't call it impossible either.
Chris S
You seem to be talking theoretically and the difference between theory and practice in theory is less than the difference between them in practice. If you doubt, get a real corpus and try it yourself; if you reliably achieve better than random chance with novel inputs, write it up and get a PhD.
msw
@msw: I happen to be getting my PhD in this. It **is** doable, though we only get about 80% to 90% accuracy. Among the people who have tried this task (or maybe this is merely a related task -- it's hard to tell from the brevity of the question) are http://research.yahoo.com/pub/2387 and http://lingcog.iit.edu/doc/appraisal_sentiment_cikm.pdf
Ken Bloom
Okay cool, seeing as you already have a camera customer review classifier, how does it rank the entry above? Sure, I have no doubts that using the highly specific, extremely well structured lexicons that you can do somewhat better than naive classifiers at a given task. However "80-90% accuracy" is marketing-speak which quite crudely and most unscientifically overstates the more tepid claims in your literature. Even if I stipulate your over-broad claim of "accuracy" of 80%, that still means that 4/5 times the method fails. I'm not sure "doable" means what you want it to mean here.
msw
@Ken: good luck with your research and degree; I mean that sincerely. It is certainly an exciting field to be working in.
msw
+1  A: 

NLTK cannot out of the box, but if you are looking for some related research on that area, take a look at this paper on Offensive Language Detection. The same methods could be adapted to detect comments which are not offensive/unoffensive, but instead happy/unhappy. The primary software package being used in this project for text classification is called WEKA and uses multiple classifiers, trained on previous examples, to determine whether language is offensive or not (and in this method uses a tunable threshold).

Chris
A: 

You're looking for a technique that uses a machine learning classifier to determine whether a piece of text is positive or negative. There have been various different attempts at this by a number of research teams (e.g. http://research.yahoo.com/pub/2387 and http://lingcog.iit.edu/doc/appraisal_sentiment_cikm.pdf) we can get about 80% to 90% accuracy at determining whether a product review is positive or negative.

Due to the brevity of your question, it's not obvious to me whether determining whether a product review is positive or negative is the same task you're trying to accomplish, or merely a related task, but I'd suggest starting simple with bag-of-words classification with a Bayesian classifier (which NLTK should be able to handle), and then improve your techniques from there depending on how the accuracy turns out.

Unfortunately, I've never used NLTK (nor Python for that matter) so I can't give you a code example of how to use NLTK for this.

Ken Bloom
The NLTK "Natural Language Processing" book includes an example of classifying text as to whether it is positive or not. The OP's question and application might be too subtle for the algorithms discussed and demonstrated, but it would be a start.
winwaed

related questions