views:

80

answers:

1

Hi,

I am trying to add Laplacian smoothing support to Biopython's Naive Bayes code 1 for my Bioinformatics project.

I have read many documents about Naive Bayes algorithm and Laplacian smoothing and I think I got the basic idea but I just can't integrate this with that code (actually I cannot see which part I will add 1 -laplacian number).

I am not familiar with Python and I am a newbie coder. I appreciate if anyone familiar with Biopython can give me some suggestions.

+2  A: 

Try using this definition of the _contents() method instead:

def _contents(items, laplace=False):
    # count occurrences of values
    counts = {}
    for item in items:
        counts[item] = counts.get(item,0) + 1.0
    # normalize
    for k in counts:
        if laplace:
            counts[k] += 1.0
            counts[k] /= (len(items)+len(counts))
        else:
            counts[k] /= len(items)
    return counts

Then change the call on Line 194 into:

# Estimate P(value|class,dim)
nb.p_conditional[i][j] = _contents(values, True)

use True to enable the smoothing, and False to disable it.

Here's a comparison of the output with/without the smoothing:

# without
>>> carmodel.p_conditional
[[{'Red': 0.40000000000000002, 'Yellow': 0.59999999999999998},
  {'SUV': 0.59999999999999998, 'Sports': 0.40000000000000002},
  {'Domestic': 0.59999999999999998, 'Imported': 0.40000000000000002}],
 [{'Red': 0.59999999999999998, 'Yellow': 0.40000000000000002},
  {'SUV': 0.20000000000000001, 'Sports': 0.80000000000000004},
  {'Domestic': 0.40000000000000002, 'Imported': 0.59999999999999998}]]

# with
>>> carmodel.p_conditional
[[{'Red': 0.42857142857142855, 'Yellow': 0.5714285714285714},
  {'SUV': 0.5714285714285714, 'Sports': 0.42857142857142855},
  {'Domestic': 0.5714285714285714, 'Imported': 0.42857142857142855}],
 [{'Red': 0.5714285714285714, 'Yellow': 0.42857142857142855},
  {'SUV': 0.2857142857142857, 'Sports': 0.7142857142857143},
  {'Domestic': 0.42857142857142855, 'Imported': 0.5714285714285714}]]

Aside from the above, I think there might be a bug with the code:

The code splits the instances according to their class, and then for each class, and giving each dimension, it counts how many times each of this dimension values appear.

The problem is if for a subset of the instances belonging to one class, it happens that not all values of a dimension appear in that subset, then when the _contents() function is called, it will not see all possible values, and thus will return the wrong probabilities...

I think you need to keep track of the all unique values for each dimension (from the entire dataset), and take that into consideration during the counting process.

Amro
Thanks very much!It helped me a lot.I While the p_conditional probabilities are different when I try it with Laplacian and without Laplacian; the accuracy didn't change.Is this something I should expect?
Limin
It all depends on the dataset used..
Amro
Also keep in mind that the code expects that all dimensions values appear for each class label, otherwise I suspect the results would be incorrect (the possible bug I mentioned in the end)
Amro
thanks. i am trying to improve the algorithm. one last question if you are not busy: how can i see maximum likelihood estimates of this algoritm? i am not very good at statistics so i couldnt figure it out.
Limin
thats what the `calculate()` function is for, it returns a dictionary where the entries are the log-likelihoods for each class (`log Pr(class_i|observation)` for each `i`). The prediction is the class with the highest likelihood
Amro
thanks a lot, i appreciate your help.
Limin