views:

590

answers:

5

How effective is naive Bayesian filtering for filtering spam?

I heard that spammers easily bypass them by stuffing extra non-spam-related words. What programming techniques can you use with Bayesian filters to prevent that?

A: 

Little outdated ... but contains good links on the pros/cons !

Learning
+5  A: 

Paul Graham was the guy to really introduce the idea of using Bayesian spam filtering to the web at large with his original article A Plan for Spam, back in August 2002. Then, his follow-up a year or so later introduced many of the problems that swiftly arose. These are still pretty great works on the topic.

In the second article, Graham mentions using CRM114, which works on a much wider set of patterns than just space-delimited words. CRM114 is cool, but comes without much implementation help for a spam filtering system.

There's the open-source powertools for Bayesian spam filtering like Death2Spam and SpamProbe.

I find nothing works quite like filtering mail through a Gmail account. Happy hunting.

danieltalsky
+2  A: 

I think for defeating the kind of spam attack you mention, the important thing is not the learning method but rather what features you train on. I use Fidelis Assis's OSBF-Lua which is a very successful filter: it keeps winning contests for spam filters. It uses Bayesian learning but I think the real reason for its success is three principles:

  • It trains not on single words but on sparse bigrams: a pair of words separated by 0 to 4 "don't care" words. The spammers have to put their message in somewhere and the sparse bigrams are very good at sussing them out. It even finds attachement spam!

  • It does extra training on message headers, because these are hard for spammers to disguise. Example: a message that originates on your network and never passes through an off-network relay host is probably not spam.

  • If the spam filter has low confidence about its classification, it requests input from a human. (In practice it adds a header field saying "please train me on this message"; the human can ignore the request.) This means that as the spammers evolve new techniques, your filter evolves to match.

This combination of techniques is extremely effective.

Disclaimer: I have worked with Fidelis on refactoring some of the software so that it can be used for other purposes such as classifying regular mail into groups or possibly one day trying to detect spam in blog comments and other places.

Norman Ramsey
+2  A: 

You're right, naive Bayesian filters are susceptible to Bayesian poisoning.

blizpasta
+1  A: 

I use Popfile to not only sort away spam but also sort my email into categories and I find it hugely effective. It uses naive Bayesian filters.

Jeff Martin