views:

338

answers:

3

I am looking to apply scores (positive, negative or neutral) to short phrases of text. Short of parsing out emoticons and making assumptions based on their usage, I'm unsure of what else to try. Can anyone provide examples, research papers, articles, etc. that take a more lexical analysis to this problem.

I am thinking things like adverb usage, punctuation misuse/repetition, spelling/grammar errors could all be decent indicators of the author's mood in an almost binary sense (good or bad).

A: 

That sounds like a really interesting idea - I'd be interested to see what comes from it.

I'd say that punctuation is one indicator you could use...

  • ? - A question
  • !?!? (or some variant) Disbelief
  • ! with phrases like stupid, idiotic, etc... - Anger
  • ... - Hesitation, sarcasm

You could also try and pick up on common acronyms like...

  • LOL - Laughing (positive)
  • WTF, OMG - Disbelief, Shock
  • IMO - Thinking, explaining

This is clearly a pretty complex thing you're looking to do, but it sounds very interesting.

Hugoware
+1  A: 

Well, latent semantic analysis (have a paper too) seems like the nearest well-established field of inquiry to what you're talking about. It's less 'value-oriented' and more focused on larger documents, but still may have some relevance to your problem.

chaos
+1  A: 

This sounds like a pretty clear binary classification task, where you can simplify the issue to positive or negative, and then make the most entropic decisions or those that haven't reached a threshold of certainty by way of probability mass set to neutral.

Your biggest hurdle will be getting training data for a stochastic machine learning method. You could easily do this with a readily available maximum entropy model such as the Toolkit for Advanced Discriminative Modeling or Mallet. The features you described would just have to be formatted to the inputs these models use.

In order to get training data, you can either do some kind of paid crowdsourcing like Amazon's Mechanical Turk or just do it yourself, maybe with the help of a friend. You'll need a lot of data for this. You can improve the predictive strength of your model in light of a dearth of data with approaches like active learning, ensembling, or boosting, but it's important to test these against real-world data as best as you can and pick what works best in a practical application.

If you're looking for papers for this, you'll want to look at the term 'sentiment analysis' in Google Scholar. The Association for Computational Linguistics has a lot of free and useful papers from conferences and journals which address the problem from a linguistic as well as algorithmic standpoint. I'd also browse their archives. Good luck!

Robert Elwell