views:

120

answers:

3

Is there a research paper/book that I can read which can tell me for the problem at hand what sort of feature selection algorithm would work best.

I am trying to simply identify twitter messages as pos/neg (to begin with). I started out with Frequency based feature selection (having started with NLTK book) but soon realised that for a similar problem various individuals have choosen different algorithms

Although I can try Frequency based, mutual information, information gain and various other algorithms the list seems endless.. and was wondering if there an efficient way then trial and error.

any advice

+2  A: 

I did an NLP course last term, and it came pretty clear that sentiment analysis is something that nobody really knows how to do well (yet). Doing this with unsupervised learning is of course even harder.

There's quite a lot of research going on regarding this, some of it commercial and thus not open to the public. I can't point you to any research papers but the book we used for the course was this (google books preview). That said, the book covers a lot of material and might not be the quickest way to find a solution to this particular problem.

The only other thing I can point you towards is to try googling around, maybe in scholar.google.com for "sentiment analysis" or "opinion mining".

Have a look at the NLTK movie_reviews corpus. The reviews are already pos/neg categorized and might help you with training your classifier. Although the language you find in Twitter is probably very different from those.

As a last note, please post any successes (or failures for that matter) here. This issue will come up later for sure at some point.

Matti
does the book have any accompanying code or is it theory heavy ?
Rahul
It's quite theory heavy and mainly focuses on the mathematical background for the methods not on their implementation. I found it from google books and you can have a browse there. I'll add the link to my original post.
Matti
A: 

Unfortunately, there is no silver bullet for anything when dealing with machine learning. It's usually referred to as the "No Free Lunch" theorem. Basically a number of algorithms work for a problem, and some do better on some problems and worse on others. Over all, they all perform about the same. The same feature set may cause one algorithm to perform better and another to perform worse for a given data set. For a different data set, the situation could be completely reversed.

Usually what I do is pick a few feature selection algorithms that have worked for others on similar tasks and then start with those. If the performance I get using my favorite classifiers is acceptable, scrounging for another half percentage point probably isn't worth my time. But if it's not acceptable, then it's time to re-evaluate my approach, or to look for more feature selection methods.

ealdent
+3  A: 

Have you tried the book I recommended upon your last question? It's freely available online and entirely about the task you are dealing with: Sentiment Analysis and Opinion Mining by Pang and Lee. Chapter 4 ("Extraction and Classification") is just what you need!

ferdystschenko
I didn't realise that it is available for free -- I just saw the question and found the pdf - i think it might be interesting - i was slightly dissuaded when i saw the $99 price tag on amazon - thanks for your help. i am reading it now..
Rahul
You're very welcome. Btw, now that you have more than 15 points of reputation, you can do upvotes, too, hehe ... ;-)
ferdystschenko