Try using this definition of the _contents()
method instead:
def _contents(items, laplace=False):
# count occurrences of values
counts = {}
for item in items:
counts[item] = counts.get(item,0) + 1.0
# normalize
for k in counts:
if laplace:
counts[k] += 1.0
counts[k] /= (len(items)+len(counts))
else:
counts[k] /= len(items)
return counts
Then change the call on Line 194
into:
# Estimate P(value|class,dim)
nb.p_conditional[i][j] = _contents(values, True)
use True
to enable the smoothing, and False
to disable it.
Here's a comparison of the output with/without the smoothing:
# without
>>> carmodel.p_conditional
[[{'Red': 0.40000000000000002, 'Yellow': 0.59999999999999998},
{'SUV': 0.59999999999999998, 'Sports': 0.40000000000000002},
{'Domestic': 0.59999999999999998, 'Imported': 0.40000000000000002}],
[{'Red': 0.59999999999999998, 'Yellow': 0.40000000000000002},
{'SUV': 0.20000000000000001, 'Sports': 0.80000000000000004},
{'Domestic': 0.40000000000000002, 'Imported': 0.59999999999999998}]]
# with
>>> carmodel.p_conditional
[[{'Red': 0.42857142857142855, 'Yellow': 0.5714285714285714},
{'SUV': 0.5714285714285714, 'Sports': 0.42857142857142855},
{'Domestic': 0.5714285714285714, 'Imported': 0.42857142857142855}],
[{'Red': 0.5714285714285714, 'Yellow': 0.42857142857142855},
{'SUV': 0.2857142857142857, 'Sports': 0.7142857142857143},
{'Domestic': 0.42857142857142855, 'Imported': 0.5714285714285714}]]
Aside from the above, I think there might be a bug with the code:
The code splits the instances according to their class, and then for each class, and giving each dimension, it counts how many times each of this dimension values appear.
The problem is if for a subset of the instances belonging to one class, it happens that not all values of a dimension appear in that subset, then when the _contents()
function is called, it will not see all possible values, and thus will return the wrong probabilities...
I think you need to keep track of the all unique values for each dimension (from the entire dataset), and take that into consideration during the counting process.