ansaurus

Question

Calculating the probability of a token being spam in a Bayesian spam filter

Answer 1

A:

Can you alter your code to use the other methods? Then you could test with a different data set, and post the results.

Simeon Pilgrim 2009-04-10 08:58:17

Actually I don't have a big enough corpus of ham and spam so it's kinda hard to test without this .. I'm using #3 for now as it seems to make some sense to me (also makes it easier to update the corpus than using post count)

Waleed Eissa 2009-04-11 23:06:11

You probably don't need a large corpus to train your filter on. Check out http://entrian.com/sbwiki/TrainingIdeas for a good outline of what SpamBayes developers have found to be effective.

ScottS 2009-04-16 19:30:32

Answer 2

A:

you may want to look at PopFile, a time tested perl implementation. It does a very good job. I am pretty sure it is open source and you could see what formula they use.

Jeff Martin 2009-04-14 15:50:58

Answer 3

+2 A:

This EACL paper by Karl-Michael Schneider(PDF) says you should use the multinomial model, meaning the total token count, for calculating the probability. Please see the paper for the exact calculations.

Yuval F 2009-04-16 19:15:30

Answer 4

A:

In general, most filters have moved past the algorithms outlined in Graham's paper. My suggestion would be to get the SpamBayes source and read the comments outlined in spambayes/classifier.py (particularly) and spambayes/tokenizer.py (especially at the top). There's a lot of history there about the early experiments that were done, evaluating decisions like this.

FWIW, in the current SpamBayes code, the probability is calculated thusly (spamcount and hamcount are the number of messages in which the token has been seen (any number of times), and nham and nspam are the total number of messages):

hamratio = hamcount / nham
spamratio = spamcount / nspam
prob = spamratio / (hamratio + spamratio)
S = options["Classifier", "unknown_word_strength"]
StimesX = S * options["Classifier", "unknown_word_prob"]
n = hamcount + spamcount
prob = (StimesX + n * prob) / (S + n)

unknown_word_strength is (by default) 0.45, and unknown_word_prob is (by default) 0.5.

Tony Meyer 2009-04-30 09:44:33

Thanks a lot for your answer, I'm going to check this. I'm currently using the total token count as this is more practical than using the post/message count, more specifically it's more practical in the sense that you don't have to keep a separate counter for the post/message count, this is esp. useful in my case as I save the corpse stats in a file (ie. the tokens and the times they were repeated in the corpse) in order not to have to scan all the posts every time the corpse needs to be updated (the posts could be too many to scan at one time).

Waleed Eissa 2009-05-01 13:33:16

so, I save the stats to a file and 'incrementally' update it, this can easily get messy if used the post count (could get out of sync with the actually scanned posts, for example in case of error)

Waleed Eissa 2009-05-01 13:35:27

ansaurus

tags:

views:

answers:

Calculating the probability of a token being spam in a Bayesian spam filter

related questions