views:

86

answers:

2

Hi guys,

I am implementing Naive Bayesian classifier for spam filtering. I have doubt on some calculation. Please clarify me what to do. Here is my question.

In this method, you have to calculate

alt text

P(S|W) -> Probability that Message is spam given word W occurs in it.

P(W|S) -> Probability that word W occurs in a spam message.

P(W|H) -> Probability that word W occurs in a Ham message.

So to calculate P(W|S), should I do

(1) (Number of times W occuring in spam)/(total number of times W occurs in all the messages)

OR

(2) (Number of times word W occurs in Spam)/(Total number of words in the spam message)

So, to calculate P(W|S), should I do (1) or (2)? (I thought it to be (2), but I am not sure, so plz clarify me)

I am refering http://en.wikipedia.org/wiki/Bayesian_spam_filtering for the info by the way.

I got to complete the implementation by this weekend :(

Thanks and regards,

MicroKernel :)

------------------------------------------------------------EDIT-----------------------------------------------------------------

@sth and @leonbloy:

Hmm... Shouldn't repeated occurrence of word 'W' increase a message's spam score? In the your approach it wouldn't, right?.

Lets take a scenario and discuss...

Lets say, we have 100 training messages, out of which 50 are spam and 50 are Ham. and say word_count of each message = 100.

And lets say, in spam messages word W occurs 5 times in each message and word W occurs 1 time in Ham message.

So total number of times W occuring in all the spam message = 5*50 = 250 times.

And total number of times W occuring in all Ham messages = 1*50 = 50 times.

Total occurance of W in all of the training messages = (250+50) = 300 times.

So, in this scenario, how do u calculate P(W|S) and P(W|H) ?

Naturally we should expect, P(W|S) > P(W|H)??? right.

Please share your thought...

+4  A: 

P(W|S) = (Number of spam messages containing W) / (Number of all spam messages)

sth
Hi sth, thanks for the answer, I have edited my query to add some more detail, plz have a look and reply again. Thank you :)
Microkernel
+1  A: 

In this Bayesian formula, W is your "feature", i.e., the thing you observe.

You must carefully define first what is W. Often you have many alternatives.

Let's say that, in a first approach, you say W is the event "message contains the word Viagra". (That is to say, W have two possible values: 0 = "message does not contain the word V..." 1="message contains at least an occurrence of that word").

In that scenario, you're right: P(W|S) is "Probability that word W appears (at least once) in a spam message." And to estimate (better than "calculate") it, you count , as the other answer says, "(Number of spam messages containing at least one word V) / (Number of all spam messages)"

An alternative approach would be: define "W = number of ocurrences of word Viagra in a message". In this case, we should estimate P(W/S) for each value of W (P(W=0/S) P(W=1/S) P(W=2/S) ... More complicated, more samples needed, better (hopely) performance.

leonbloy
Hi leonbloy, thanks for the answer, I have edited my query to add some more detail, plz have a look and reply again. Thank you :)
Microkernel
Your added text simply suggests that my second "alternative aproach" should perform better. Probably -but at some cost. Just take it if you like, it's your choice. But first decide which is your W, after then write the formula and deduce how to estimate each term. BTW, if your are thinking in practical terms (and not simply as an exercise about bayesian classification), this is still too naive, you'd have a very long way to walk before getting a decent classifier for spam.
leonbloy