I recently wrote a Bayesian spam filter, I used Paul Graham's article Plan for Spam and an implementation of it in C# I found on codeproject as references to create my own filter.
I just noticed that the implementation on CodeProject uses the total number of unique tokens in calculating the probability of a token being spam (e.g. if the ham corpus contains 10000 tokens in total but 1500 unqiue tokens, the 1500 is used in calculating the probability as ngood), but in my implementation I used the number of posts as mentioned in Paul Graham's article, this makes me wonder which one of these should be better in calculating the probability:
- Post count (as mentioned in Paul Graham's article)
- Total unique token count (as used in the implementation on codeproject)
- Total token count
- Total included token count (ie. those tokens with b + g >= 5)
- Total unique included token count