views:

176

answers:

4

Hi, I have a process which iterates String instances. Each iteration does few operations on the String instance. At the end the String instance is persisted.

Now, I want to add for each iteration a check if the String instance might be spam. I only have to verify that the String instance is not "adult materials" spam.

Any recommendations?

+4  A: 

This is a very hard problem that the industry is constantly trying to solve. The best thing for you to do is to try and use an existing solution like Classifier4J along with a black-list datasource to identify spam.

Andrew Hare
A: 

Easiest way is simply to check against known spam words. The problem here is that it's easy to get false positives with words that mean different things in different contexts. You either need to hand-pick the word list and only include those which have no legitimate reason, or opt for a more heavyweight solution.

Draemon
+1  A: 

You need to apply some Bayesian logic, which is what, among other things, Classifier4J that Andrew mentioned is doing beneath the covers.

Paul Graham wrote a good article about this a few years back - http://www.paulgraham.com/spam.html.

Nick Holt
+1  A: 

You could try writing your own classifier etc, but if you have guaranteed network access, how about just using Akismet and the Java bindings? It's pretty good for finding spam.

You'll need to take the network connectivity and licensing into consideration.

GaryF