views:

615

answers:

4

I recently wrote a Bayesian spam filter, I used Paul Graham's article Plan for Spam and an implementation of it in C# I found on codeproject as references to create my own filter.

I just noticed that the implementation on CodeProject uses the total number of unique tokens in calculating the probability of a token being spam (e.g. if the ham corpus contains 10000 tokens in total but 1500 unqiue tokens, the 1500 is used in calculating the probability as ngood), but in my implementation I used the number of posts as mentioned in Paul Graham's article, this makes me wonder which one of these should be better in calculating the probability:

  1. Post count (as mentioned in Paul Graham's article)
  2. Total unique token count (as used in the implementation on codeproject)
  3. Total token count
  4. Total included token count (ie. those tokens with b + g >= 5)
  5. Total unique included token count
A: 

Can you alter your code to use the other methods? Then you could test with a different data set, and post the results.

Simeon Pilgrim
Actually I don't have a big enough corpus of ham and spam so it's kinda hard to test without this .. I'm using #3 for now as it seems to make some sense to me (also makes it easier to update the corpus than using post count)
Waleed Eissa
You probably don't need a large corpus to train your filter on. Check out http://entrian.com/sbwiki/TrainingIdeas for a good outline of what SpamBayes developers have found to be effective.
ScottS
A: 

you may want to look at PopFile, a time tested perl implementation. It does a very good job. I am pretty sure it is open source and you could see what formula they use.

Jeff Martin
+2  A: 

This EACL paper by Karl-Michael Schneider(PDF) says you should use the multinomial model, meaning the total token count, for calculating the probability. Please see the paper for the exact calculations.

Yuval F
A: 

In general, most filters have moved past the algorithms outlined in Graham's paper. My suggestion would be to get the SpamBayes source and read the comments outlined in spambayes/classifier.py (particularly) and spambayes/tokenizer.py (especially at the top). There's a lot of history there about the early experiments that were done, evaluating decisions like this.

FWIW, in the current SpamBayes code, the probability is calculated thusly (spamcount and hamcount are the number of messages in which the token has been seen (any number of times), and nham and nspam are the total number of messages):

hamratio = hamcount / nham
spamratio = spamcount / nspam
prob = spamratio / (hamratio + spamratio)
S = options["Classifier", "unknown_word_strength"]
StimesX = S * options["Classifier", "unknown_word_prob"]
n = hamcount + spamcount
prob = (StimesX + n * prob) / (S + n)

unknown_word_strength is (by default) 0.45, and unknown_word_prob is (by default) 0.5.

Tony Meyer
Thanks a lot for your answer, I'm going to check this. I'm currently using the total token count as this is more practical than using the post/message count, more specifically it's more practical in the sense that you don't have to keep a separate counter for the post/message count, this is esp. useful in my case as I save the corpse stats in a file (ie. the tokens and the times they were repeated in the corpse) in order not to have to scan all the posts every time the corpse needs to be updated (the posts could be too many to scan at one time).
Waleed Eissa
so, I save the stats to a file and 'incrementally' update it, this can easily get messy if used the post count (could get out of sync with the actually scanned posts, for example in case of error)
Waleed Eissa