views:

17

answers:

1

Hi,

I'm working on a project where I need to create a spam database and accept submissions from users. Accepting the submissions is easy, but I was trying to figure out how to weight these submissions.

Let's say the database consists of words, and i get the following submissions: * 137x "banana" * 22x "apple" * 1x "exploding mouse"

Now, there's a fairly good chance that "banana" is a spam word. "Apple" might be, but should probably be considered in a grey list, while "exploding mouse" is probably just a prank.

Anyone got any good ideas?

Cheers!

A: 

The standard method is "bayesian", where you compare the word frequencies in spam against the word frequencies in non-spam (aka "ham"). The problem with that is while people would be perfectly willing to forward you all their spam, they're unlikely to want to forward you their ham.

One program that does this already is called "bogofilter". There is a standard Debian package for it.

Paul Tomblin
Thanks, Paul.While I cannot get all the ham, I can collect statistics and get the total number of any occurrences of anything and compare to the frequency of spam.How about setting the probability to a percentage of the occurrences of the highest complained about "word"? If "banana" has 100 reports as spam, while "apple" only has 40, I could say there's a 40% chance of "apple" being spam?The thing is that this db will be used in different parts of the world, thus "banana" would be reported as spam a lot more than "banan" (scandinavian).
If "eple" (scandinavian for "apple") got 40% as many hits as "banan" and "banan" got 7% as many hits as "banana", "eple" would be treated as ham almost no matter what the number of reports were (as all of scandinavia is like a medium big american city)...