views:

495

answers:

4

Hi,

I'm doing a project for a college class I'm taking.

I'm using PHP to build a simple web app that classify tweets as "positive" (or happy) and "negative" (or sad) based on a set of dictionaries. The algorithm I'm thinking of right now is Naive Bayes classifier or decision tree.

However, I can't find any PHP library that helps me do some serious language processing. Python has NLTK (http://www.nltk.org). Is there anything like that for PHP?

I'm planning to use WEKA as the back end of the web app (by calling Weka in command line from within PHP), but it doesn't seem that efficient.

Do you have any idea what I should use for this project? Or should I just switch to Python?

Thanks

+2  A: 

If you're going to be using a Naive Bayes classifier, you don't really need a whole ton of NL processing. All you'll need is an algorithm to stem the words in the tweets and if you want, remove stop words.

Stemming algorithms abound and aren't difficult to code. Removing stop words is just a matter of searching a hash map or something similar. I don't see a justification to switch your development platform to accomodate the NLTK, although it is a very nice tool.

San Jacinto
A: 

Take a look at this suggestion.

nuqqsa
There is no indication whatsoever in either your post or the one your linked to as to why this is a fitting solution.
San Jacinto
PEAR's Text_LanguageDetect can identify 52 human languages from text samples and return confidence scores for each. Isn't this an interesting option to take into account?
nuqqsa
+1  A: 

Take a look at this link to an article on Bayesian opinion mining on php/ir http://phpir.com/bayesian-opinion-mining It's a site that's well worth bookmarking

Mark Baker
A: 

You can also use the uClassify API to do something similar to Naive Bayes. You basically train a classifier as you would with any algorithm (except here you're doing it via the web interface or by sending xml documents to the API). Then whenever you get a new tweet (or batch of tweets), you call the API to have it classify them. It's fast and you don't have to worry about tuning it. Of course, that means you lose the flexibility you get by controlling the classifier yourself, but that also means less work for you if that in itself is not the goal of the class project.

ealdent