views:

98

answers:

1

i start using NaiveBayes/Simple classifier for classification (Weka), however i have some problems to understand while training the data. The data set i'm using is weather.nominal.arff.

alt text

While i use use training test from the options, the classifier result is :

Correctly Classified Instances 13  -  92.8571 %    
Incorrectly Classified Instances 1 - 7.1429 %   

a b classified as  
9 0  a =yes
1 4  b = no

My first question what should i understand from the incorrect classified instances? Why such a problem occurred? which attribute collection is classified incorrect? is there a way to understand this?

Second when i try the 10 fold cross validation, why i get different (less) correctly classified instances?

the results are:

Correctly Classified Instances           8               57.1429 %
Incorrectly Classified Instances         6               42.8571 %

 a b   <-- classified as
 7 2 | a = yes
 4 1 | b = no
+5  A: 

You can get the individual predictions for each instance by choosing this option from:

More Options... > Output predictions > PlainText

Which will give you in addition to the evaluation metrics, the following:

=== Predictions on training set ===

 inst#     actual  predicted error prediction
     1       2:no       2:no       0.704 
     2       2:no       2:no       0.847 
     3      1:yes      1:yes       0.737 
     4      1:yes      1:yes       0.554 
     5      1:yes      1:yes       0.867 
     6       2:no      1:yes   +   0.737 
     7      1:yes      1:yes       0.913 
     8       2:no       2:no       0.588 
     9      1:yes      1:yes       0.786 
    10      1:yes      1:yes       0.845 
    11      1:yes      1:yes       0.568 
    12      1:yes      1:yes       0.667 
    13      1:yes      1:yes       0.925 
    14       2:no       2:no       0.652 

which indicates that the 6th instances was misclassified. Note that even if you train and test on the same instances, misclassifications can occur due to inconsistencies in the data (the simplest example is having two instances with the same features but with different class label).

Keep in mind that the above way of testing is biased (its somewhat cheating since it can see the answers to the questions). Thus we are usually interested in getting a more realistic estimate of the model error on unseen data. Cross-validation is one such technique, where it partition the data into 10 stratified folds, performing the testing on one fold, while training on the other nine, finally it reports the average accuracy across the ten runs.

Amro
thanks for the clear answer and weka tip+1. the confusing point is biased, what do you mean? Should i always use cross validation for all of my different classification algorithms ?
berkay
think about it, you want to learn a Naive Bayes net that models your data, then you want to test its prediction accuracy. If you train the model and test it on the same set of instances, then you are overestimating its accuracy (it has seen those particular examples thus perform well on them), but will probably be less successful on new data. The key point here is **generalization**: we want to generalize beyond the instances that have been provided at "training time" to new unseen examples.
Amro
Amro thanks for clear answers. i'm posting here to ask a question of recall and precision of cross validation results. recall (7/(2+7))=0778 and precision is (1/(1+4))=0.2,however weka says for precision=0.636 ? any idea about this?
berkay
@berkay: thats not the correct computation. For `class=yes` we have `precision=7/(7+4)=0.636363` and `recall=7/(7+2)=0.777777`, same logic for `class=no`: http://en.wikipedia.org/wiki/Precision_and_recall#Definition_.28classification_context.29
Amro
@amro okey, i got it. i'm building the confusion matrix in a different way so getting errors. Thanks amro.
berkay