tags:

views:

61

answers:

3

I'm using the BayesianClassifier class to classify spam. The problem is that compound words aren't being recognized.

For instance if I add led zeppelin as a match, a sentence containing it won't be recognized as a match even though it should.

For adding a match I'm using addMatch() of SimpleWordsDataSource

And for asking for a match I'm using isMatch() of BayesianClassifier

Any ideas on how to fix this?

Thanks in advance!

A: 

Ok, thanks for the insight. I'm attaching more source code.

SimpleWordsDataSource wds = new SimpleWordsDataSource();
BayesianClassifier classifier = new BayesianClassifier(wds);

wds.addMatch("queen");
wds.addMatch("led zeppelin");
wds.addMatch("the beatles");

classifier.isMatch("i listen to queen");// it is recognized as a match
classifier.isMatch("i listen to led zeppelin");// it is NOT recognized as a match
classifier.isMatch("i listen to the beatles");// it is NOT recognized as a match
avandelay
A: 

Now I'm using the teachMatch method of BayesianClassifier and I've got different results. A sentence containing led zeppelin it is classified as a match, wich is ok. But a sentence including led it is also classified as a match, wich is wrong.

Here's the relevant code:

BayesianClassifier classifier = new BayesianClassifier();
classifier.teachMatch("led zeppelin");
classifier.isMatch("I listen to led zeppelin");//true
classifier.isMatch("I listen to led");//true
avandelay
According to the little blurb on their examples page, you need to teach non matches too. (http://classifier4j.sourceforge.net/usage.html). So maybe teachNonMatch a few other examples and see if still does that.My guess is what's happening is, since BayesianClassifier uses probabilities, there's a high enough probability that "led" matches "led zeppelin" to trigger. If you want exact matches only, is a BayesianClassifier what you need?
I82Much
I've added non matches, but it doesn't change anything respecting this issue. I want "led zeppelin" to be considered as a whole, I don't want to take into account the individual words. If I'm not mistaken a Bayesian Classifier should be able to accept compund words as an input.
avandelay
I've been taking a look a the source code, and what I want doesn't seem to be supported.
avandelay
A: 

(I wrote classifier4j)

You need to train it with more data.

Bayesian classifiers work by creating statistical models of what is considered a match and what isn't.

If you give it enough data, it will learn that "led and zeppelin" is a match, but "led" by itself isn't

Nick Lothian