views:

293

answers:

6

Hi Guys,

For one of my course project I started implementing "Naive Bayesian classifier" in C. My project is to implement a document classifier application (especially Spam) using huge training data.

Now I have problem implementing the algorithm because of the limitations in the C's datatype.

( Algorithm I am using is given here, http://en.wikipedia.org/wiki/Bayesian_spam_filtering )

PROBLEM STATEMENT: The algorithm involves taking each word in a document and calculating probability of it being spam word. If p1, p2 p3 .... pn are probabilities of word-1, 2, 3 ... n. The probability of doc being spam or not is calculated using

alt text

Here, probability value can be very easily around 0.01. So even if I use datatype "double" my calculation will go for a toss. To confirm this I wrote a sample code given below.

#define PROBABILITY_OF_UNLIKELY_SPAM_WORD     (0.01)
#define PROBABILITY_OF_MOSTLY_SPAM_WORD     (0.99)

int main()
{
    int index;
    long double numerator = 1.0;
    long double denom1 = 1.0, denom2 = 1.0;
    long double doc_spam_prob;

    /* Simulating FEW unlikely spam words  */
    for(index = 0; index < 162; index++)
    {
        numerator = numerator*(long double)PROBABILITY_OF_UNLIKELY_SPAM_WORD;
        denom2    = denom2*(long double)PROBABILITY_OF_UNLIKELY_SPAM_WORD;
        denom1    = denom1*(long double)(1 - PROBABILITY_OF_UNLIKELY_SPAM_WORD);
    }
    /* Simulating lot of mostly definite spam words  */
    for (index = 0; index < 1000; index++)
    {
        numerator = numerator*(long double)PROBABILITY_OF_MOSTLY_SPAM_WORD;
        denom2    = denom2*(long double)PROBABILITY_OF_MOSTLY_SPAM_WORD;
        denom1    = denom1*(long double)(1- PROBABILITY_OF_MOSTLY_SPAM_WORD);
    }
    doc_spam_prob= (numerator/(denom1+denom2));
    return 0;
}

I tried Float, double and even long double datatypes but still same problem.

Hence, say in a 100K words document I am analyzing, if just 162 words are having 1% spam probability and remaining 99838 are conspicuously spam words, then still my app will say it as Not Spam doc because of Precision error (as numerator easily goes to ZERO)!!!.

This is the first time I am hitting such issue. So how exactly should this problem be tackled?

+2  A: 

You can use probability in percents or promiles:

doc_spam_prob= (numerator*100/(denom1+denom2));

or

doc_spam_prob= (numerator*1000/(denom1+denom2));

or use some other coefficient

Андрей Костенко
+17  A: 

This happens often in machine learning. AFAIK, there's nothing you can do about the loss in precision. So to bypass this, we use the log function and convert divisions and multiplications to subtractions and additions, resp.

SO I decided to do the math,

The original equation is:

alt text

I slightly modify it:

alt text

If you need me to expand on this, please leave a comment.

Jacob
+1. interesting idea. although it does a lot more calculation and may not be necessary, if not all `p_i` are close to 0.
back2dos
@back2dos - It's not necessary only if *n* is small --- which is not the case most of the time.
Jacob
Working with probabilities in the log domain is pretty much the only sensible way to do the calculations. log-likelihood ratios (the penultimate equation in Jacob's answer) are the easiest form to work with.
Adam Bowen
sorry, that was not so correct. it shoud be *if all `p_i` are "sufficiently well distributed"*. please see my answer.
back2dos
Yeah, I got it. Thanks a lot for this. This is awesome way to go about it. I will try this and Answer by back2dos to see which one suites my requirement most. I am really impressed by the elegance of your solution :)By the way I have to use e = 2.72 right?
Microkernel
@Microkernel: Thanks :) - you can use the `exp` function http://www.codecogs.com/reference/c/math.h/exp.php . I.e use `exp(eta)` instead of `pow(2.71828182845905,eta)`.
Jacob
@Microkernel: Also, with back2dos' answer, there is a risk of losing precision based on the values of the numerator and denominator. This method works for all values, and I don't think the `log` function is prohibitively expensive. Apart from that, this is the standard approach to this problem - use logarithms to simplify the problem into a series of summations wherever possible.
Jacob
This will still lose some accuracy when the individual `p_i` terms are very small; if that is an issue for your purposes, the solution is to replace the `ln(1 - p_i)` terms with `log1p(-p_i)`, which will not suffer from the same problem. (`log1p` is an underutilized gem of the C standard library)
Stephen Canon
@Stephen Canon: How small? And by accuracy, do you mean `exp(log(p_i)) - p_i`?
Jacob
If the binary exponent of `p_i` is `-n`, you should expect to lose `n-1` bits of accuracy in computing `log(1 - p_i)`. Thus, if `p_i` is `0.1` (binary exponent: `-3`), you lose 2 bits of accuracy that you could retain by instead using `log1p(-p_i)`. Obviously that isn't so bad, but if `p_i` is much smaller than that, the loss can be substantial. Whether or not the difference is worth worrying about depends on the distribution of the `p_i`s. If they are all small and similar in scale, then it matters a lot. If they are of greatly varied scale, it might not matter at all.
Stephen Canon
Note that in this particular usage, it won't matter, because when the `p_i` are small, the `log(p_i)` terms will dominate the `log(1 - p_i)` terms, and so loss of accuracy in the small terms will have a negligible effect on the final result. In more general usage, if you have a numerically sensitive computation that involves a term of the form `log(1 + x)`, you should consider replacing it with `log1p(x)`.
Stephen Canon
@Stephen: Thanks for the detailed response!
Jacob
@Jacob : oh right... Thanks a lot :)
Microkernel
@Stephen Canon: Yeah that could be an issue. Thanks a lot for suggestion. We will definitely use it :)
Microkernel
@Microkernel: Lol, no probs. Good luck with your classifier!
Jacob
A: 

I am not strong in math so I cannot comment on possible simplifications to the formula that might eliminate or reduce your problem. However, I am familiar with the precision limitations of long double types and am aware of several arbitrary and extended precision math libraries for C. Check out:

http://www.nongnu.org/hpalib/ and http://www.tc.umn.edu/~ringx004/mapm-main.html

Tom Cabanski
+2  A: 

Try computing the inverse 1/p. That gives you an equation of the form 1 + 1/(1-p1)*(1-p2)...

If you then count the occurrence of each probability--it looks like you have a small number of values that recur--you can use the pow() function--pow(1-p, occurences_of_p)*pow(1-q, occurrences_of_q)--and avoid individual roundoff with each multiplication.

John Gordon
+1. basically the right idea. maybe it'll even suffice.
back2dos
That is **not** 1/p, see my answer. Even if you were right, it still involves multiplying (1-p_i) which can take on any value from 0 - 1, so if it takes on values close to 1, we're back to square one.
Jacob
+4  A: 

here's a trick

for the sake of readability, let S := p_1 * ... * p_n and H := (1-p_1) * ... * (1-p_n), 
then we have:

  p = S / (S + H)
  p = 1 / ((S + H) / S)
  p = 1 / (1 + H / S)

let`s expand again:

  p = 1 / (1 +  ((1-p_1) * ... * (1-p_n)) / (p_1 * ... * p_n))
  p = 1 / (1 + (1-p_1)/p_1 * ... * (1-p_n)/p_n)

so basically, you will obtain a product of quite large numbers (between 0 and, for p_i = 0.01, 99). The idea is, not to multiply tons of small numbers with one another, to obtain, well, 0, but to make a quotient of two small numbers. For example, if n = 1000000 and p_i = 0.5 for all i, the above method will give you 0/(0+0) which is NaN, whereas the proposed method will give you 1/(1+1*...1), which is 0.5.

You can get even better results, when all p_i are sorted and you pair them up in opposed order (let's assume p_1 < ... < p_n), then the following formula will get even better precision:

  p = 1 / (1 + (1-p_1)/p_n * ... * (1-p_n)/p_1)

that way you devide big numerators (small p_i) with big denominators (big p_(n+1-i)), and small numerators with small denominators.

edit: MSalter proposed a useful further optimization in his answer. Using it, the formula reads as follows:

  p = 1 / (1 + (1-p_1)/p_n * (1-p_2)/p_(n-1) * ... * (1-p_(n-1))/p_2 * (1-p_n)/p_1)

hope that helps. :)

greetz
back2dos

back2dos
This is really interesting idea... I will try this and Answer by Jacob to see which will suite my requirements well. Thanks a lot :)
Microkernel
The "sort the terms" indeed works, but it works better if you dynamically pick either big or small terms to keep your intermediate result around 1.0. See my answer.
MSalters
@MSalters: good point. I think the best solution is to pair up probabilities in opposed order, as I did, to keep factors closer to 1, and then rearrange the factors in an alternating manner, as you proposed.
back2dos
Actually, I started with the same approach, and then noticed you'd get a runaway effect if you have a small number of extreme terms balanced by a large set of non-extreme terms. I.e. a few `p=0.01` balanced by lots of `p=0.51`. Initially you'd pair up the few `0.01` terms with `0.51` terms, and run off towards infinity. Afterwards you'd be pairing those `p=0.51` terms, and repeatedly multiplying infinity with 0.98. That just didn't work.
MSalters
+2  A: 

Your problem is caused because you are collecting too many terms without regard for their size. One solution is to take logarithms. Another is to sort your individual terms. First, let's rewrite the equation as 1/p = 1 + ∏((1-p_i)/p_i). Now your problem is that some of the terms are small, while others are big. If you have too many small terms in a row, you'll underflow, and with too many big terms you'll overflow the intermediate result.

So, don't put too many of the same order in a row. Sort the terms (1-p_i)/p_i. As a result, the first will be the smallest term, the last the biggest. Now, if you'd multiply them straight away you would still have an underflow. But the order of calculation doesn't matter. Use two iterators into your temporary collection. One starts at the beginning (i.e. (1-p_0)/p_0), the other at the end (i.e (1-p_n)/p_n), and your intermediate result starts at 1.0. Now, when your intermediate result is >=1.0, you take a term from the front, and when your intemediate result is < 1.0 you take a result from the back.

The result is that as you take terms, the intermediate result will oscillate around 1.0. It will only go up or down as you run out of small or big terms. But that's OK. At that point, you've consumed the extremes on both ends, so it the intermediate result will slowly approach the final result.

There's of course a real possibility of overflow. If the input is completely unlikely to be spam (p=1E-1000) then 1/p will overflow, because ∏((1-p_i)/p_i) overflows. But since the terms are sorted, we know that the intermediate result will overflow only if ∏((1-p_i)/p_i) overflows. So, if the intermediate result overflows, there's no subsequent loss of precision.

MSalters
+1. I updated my answer. I think the best is to combine both algorithms, since mine suffers less precision loss for calculation of factors, and yours less for calculation of the overall product.
back2dos