ansaurus

Question

Implementing Naïve Bayes algorithm in Java - Need some guidance

Answer 1

+2 A:

What you are doing is almost correct.

         + Then to find P( yes | E) and P( no | E) i multiply the PDF value of all 4 given attributes and compare which is larger, which indicates the class it belongs to

Here, you forgot to multiply the prior P(yes) or P(no). Remember the decision formulae:

P(Yes | E) ~= P(Attr_1 | Yes) * P(Attr_2 | Yes) * P(Attr_3 | Yes) * P(Attr_4 | Yes) * P(Yes)

For Naive Bayes (and any other supervised learning/classification algorithms), you need to have training data and testing data. You use training data to train the model and do prediction on the testing data. You could simply use training data as testing data. Or you can split the csv file into two pieces, one for training and one for testing. You could also do cross validation on the csv file.

Yin Zhu 2010-05-24 13:42:20

Definitely use cross-validation if possible. Never test on your training data if you can avoid it.

Shaggy Frog 2010-05-25 23:57:21

@Shaggy, testing on training is an option, and a must for a new dataset or new implemented classifier. It tells you how well the optimization is done. If a classifier does not perform well on training data, then this classifier cannot be used for the data set. The performance on the training data can also be used for diagnosis purpose when writing a classifier.

Yin Zhu 2010-05-26 00:12:17

Answer 2

A:

Thank you for your answer Yin Zhu.

But if you check on my first post all my data is numeric, multiplying by Zero will gives inaccurate result that's why i didn't want to multiply by prior P(Yes) or P(No) cause the class is numeric 0 or 1.(Are you sure i should?)

The Zero attributes are also hammering my progress. For example i know Laplace Correction adds 1 to denominator and 3 to numerator BUT given i'm using PDF, do you still apply Laplace?

I have the mean and Standard deviation of Class P(Yes) and P(No) right, so i assumed using the PDF formula to plugin the value of each attribute would give the wanted value but now not sure and I'm really confused about this because what if the value is 0.0.(Some of the data i have in the set look like :

0.00123,0.4567655,0.0,0.212222,1 (Last value being the class, 1 to 4 the attributes, that 0.0 there would cause problems when i do P(Attr_1 | Yes) * P(Attr_2 | Yes) *.... ).

can you please help me out with this point as well?

I figured out what to do with the Test data/Training Data. I'm using a "10 fold" cross-validation technique.

techventure 2010-05-25 23:45:13

Do not reply to your own question to add more information. Edit your question or add a comment. Stack Overflow is not a forum.

Shaggy Frog 2010-05-25 23:58:01

you need to `smooth` the prob values, e.g. by adding 0.01 to P(Attr_i | Yes).

Yin Zhu 2010-05-26 00:22:51

point taken "Shaggy Frog" i did so i could use the text formatting commands like Code since they just appear as normal text in Comments made here.yin Zhu: So in case of my example: (0.00123,0.4567655,0.0,0.212222,1) do i add 0.01 to the 0 value or to all attributes?

techventure 2010-05-26 01:04:20

ansaurus

tags:

views:

answers:

Implementing Naïve Bayes algorithm in Java - Need some guidance

related questions