ansaurus

Question

Problem with Precision floating point operation in C

Answer 1

+2 A:

You can use probability in percents or promiles:

doc_spam_prob= (numerator*100/(denom1+denom2));

or

doc_spam_prob= (numerator*1000/(denom1+denom2));

or use some other coefficient

Андрей Костенко 2010-04-22 13:13:51

Answer 2

+17 A:

This happens often in machine learning. AFAIK, there's nothing you can do about the loss in precision. So to bypass this, we use the log function and convert divisions and multiplications to subtractions and additions, resp.

SO I decided to do the math,

The original equation is:

$alt text$

I slightly modify it:

alt text

If you need me to expand on this, please leave a comment.

Jacob 2010-04-22 13:16:40

+1. interesting idea. although it does a lot more calculation and may not be necessary, if not all `p_i` are close to 0.

back2dos 2010-04-22 13:49:08

@back2dos - It's not necessary only if *n* is small --- which is not the case most of the time.

Jacob 2010-04-22 13:53:25

Working with probabilities in the log domain is pretty much the only sensible way to do the calculations. log-likelihood ratios (the penultimate equation in Jacob's answer) are the easiest form to work with.

Adam Bowen 2010-04-22 14:04:44

sorry, that was not so correct. it shoud be *if all `p_i` are "sufficiently well distributed"*. please see my answer.

back2dos 2010-04-22 14:21:46

Yeah, I got it. Thanks a lot for this. This is awesome way to go about it. I will try this and Answer by back2dos to see which one suites my requirement most. I am really impressed by the elegance of your solution :)By the way I have to use e = 2.72 right?

Microkernel 2010-04-22 16:37:58

@Microkernel: Thanks :) - you can use the `exp` function http://www.codecogs.com/reference/c/math.h/exp.php . I.e use `exp(eta)` instead of `pow(2.71828182845905,eta)`.

Jacob 2010-04-22 17:15:45

@Microkernel: Also, with back2dos' answer, there is a risk of losing precision based on the values of the numerator and denominator. This method works for all values, and I don't think the `log` function is prohibitively expensive. Apart from that, this is the standard approach to this problem - use logarithms to simplify the problem into a series of summations wherever possible.

Jacob 2010-04-22 17:27:44

This will still lose some accuracy when the individual `p_i` terms are very small; if that is an issue for your purposes, the solution is to replace the `ln(1 - p_i)` terms with `log1p(-p_i)`, which will not suffer from the same problem. (`log1p` is an underutilized gem of the C standard library)

Stephen Canon 2010-04-22 19:19:17

@Stephen Canon: How small? And by accuracy, do you mean `exp(log(p_i)) - p_i`?

Jacob 2010-04-22 19:24:29

If the binary exponent of `p_i` is `-n`, you should expect to lose `n-1` bits of accuracy in computing `log(1 - p_i)`. Thus, if `p_i` is `0.1` (binary exponent: `-3`), you lose 2 bits of accuracy that you could retain by instead using `log1p(-p_i)`. Obviously that isn't so bad, but if `p_i` is much smaller than that, the loss can be substantial. Whether or not the difference is worth worrying about depends on the distribution of the `p_i`s. If they are all small and similar in scale, then it matters a lot. If they are of greatly varied scale, it might not matter at all.

Stephen Canon 2010-04-22 20:23:35

Note that in this particular usage, it won't matter, because when the `p_i` are small, the `log(p_i)` terms will dominate the `log(1 - p_i)` terms, and so loss of accuracy in the small terms will have a negligible effect on the final result. In more general usage, if you have a numerically sensitive computation that involves a term of the form `log(1 + x)`, you should consider replacing it with `log1p(x)`.

Stephen Canon 2010-04-22 20:29:36

@Stephen: Thanks for the detailed response!

Jacob 2010-04-22 20:41:08

@Jacob : oh right... Thanks a lot :)

Microkernel 2010-04-23 05:34:32

@Stephen Canon: Yeah that could be an issue. Thanks a lot for suggestion. We will definitely use it :)

Microkernel 2010-04-23 05:36:18

@Microkernel: Lol, no probs. Good luck with your classifier!

Jacob 2010-04-23 13:55:36

Answer 3

A:

I am not strong in math so I cannot comment on possible simplifications to the formula that might eliminate or reduce your problem. However, I am familiar with the precision limitations of long double types and am aware of several arbitrary and extended precision math libraries for C. Check out:

http://www.nongnu.org/hpalib/ and http://www.tc.umn.edu/~ringx004/mapm-main.html

Tom Cabanski 2010-04-22 13:18:58

Answer 4

+2 A:

Try computing the inverse 1/p. That gives you an equation of the form 1 + 1/(1-p1)*(1-p2)...

If you then count the occurrence of each probability--it looks like you have a small number of values that recur--you can use the pow() function--pow(1-p, occurences_of_p)*pow(1-q, occurrences_of_q)--and avoid individual roundoff with each multiplication.

John Gordon 2010-04-22 13:43:37

+1. basically the right idea. maybe it'll even suffice.

back2dos 2010-04-22 13:49:58

That is **not** 1/p, see my answer. Even if you were right, it still involves multiplying (1-p_i) which can take on any value from 0 - 1, so if it takes on values close to 1, we're back to square one.

Jacob 2010-04-22 13:56:02

Answer 5

+4 A:

here's a trick

for the sake of readability, let S := p_1 * ... * p_n and H := (1-p_1) * ... * (1-p_n), 
then we have:

  p = S / (S + H)
  p = 1 / ((S + H) / S)
  p = 1 / (1 + H / S)

let`s expand again:

  p = 1 / (1 +  ((1-p_1) * ... * (1-p_n)) / (p_1 * ... * p_n))
  p = 1 / (1 + (1-p_1)/p_1 * ... * (1-p_n)/p_n)

so basically, you will obtain a product of quite large numbers (between 0 and, for p_i = 0.01, 99). The idea is, not to multiply tons of small numbers with one another, to obtain, well, 0, but to make a quotient of two small numbers. For example, if n = 1000000 and p_i = 0.5 for all i, the above method will give you 0/(0+0) which is NaN, whereas the proposed method will give you 1/(1+1*...1), which is 0.5.

You can get even better results, when all p_i are sorted and you pair them up in opposed order (let's assume p_1 < ... < p_n), then the following formula will get even better precision:

  p = 1 / (1 + (1-p_1)/p_n * ... * (1-p_n)/p_1)

that way you devide big numerators (small p_i) with big denominators (big p_(n+1-i)), and small numerators with small denominators.

edit: MSalter proposed a useful further optimization in his answer. Using it, the formula reads as follows:

  p = 1 / (1 + (1-p_1)/p_n * (1-p_2)/p_(n-1) * ... * (1-p_(n-1))/p_2 * (1-p_n)/p_1)

hope that helps. :)

greetz
back2dos

back2dos 2010-04-22 13:46:31

This is really interesting idea... I will try this and Answer by Jacob to see which will suite my requirements well. Thanks a lot :)

Microkernel 2010-04-22 16:35:46

The "sort the terms" indeed works, but it works better if you dynamically pick either big or small terms to keep your intermediate result around 1.0. See my answer.

MSalters 2010-04-23 08:46:05

@MSalters: good point. I think the best solution is to pair up probabilities in opposed order, as I did, to keep factors closer to 1, and then rearrange the factors in an alternating manner, as you proposed.

back2dos 2010-04-23 09:24:05

Actually, I started with the same approach, and then noticed you'd get a runaway effect if you have a small number of extreme terms balanced by a large set of non-extreme terms. I.e. a few `p=0.01` balanced by lots of `p=0.51`. Initially you'd pair up the few `0.01` terms with `0.51` terms, and run off towards infinity. Afterwards you'd be pairing those `p=0.51` terms, and repeatedly multiplying infinity with 0.98. That just didn't work.

MSalters 2010-04-23 11:30:42

Answer 6

+2 A:

Your problem is caused because you are collecting too many terms without regard for their size. One solution is to take logarithms. Another is to sort your individual terms. First, let's rewrite the equation as 1/p = 1 + ∏((1-p_i)/p_i). Now your problem is that some of the terms are small, while others are big. If you have too many small terms in a row, you'll underflow, and with too many big terms you'll overflow the intermediate result.

So, don't put too many of the same order in a row. Sort the terms (1-p_i)/p_i. As a result, the first will be the smallest term, the last the biggest. Now, if you'd multiply them straight away you would still have an underflow. But the order of calculation doesn't matter. Use two iterators into your temporary collection. One starts at the beginning (i.e. (1-p_0)/p_0), the other at the end (i.e (1-p_n)/p_n), and your intermediate result starts at 1.0. Now, when your intermediate result is >=1.0, you take a term from the front, and when your intemediate result is < 1.0 you take a result from the back.

The result is that as you take terms, the intermediate result will oscillate around 1.0. It will only go up or down as you run out of small or big terms. But that's OK. At that point, you've consumed the extremes on both ends, so it the intermediate result will slowly approach the final result.

There's of course a real possibility of overflow. If the input is completely unlikely to be spam (p=1E-1000) then 1/p will overflow, because ∏((1-p_i)/p_i) overflows. But since the terms are sorted, we know that the intermediate result will overflow only if ∏((1-p_i)/p_i) overflows. So, if the intermediate result overflows, there's no subsequent loss of precision.

MSalters 2010-04-23 08:43:52

+1. I updated my answer. I think the best is to combine both algorithms, since mine suffers less precision loss for calculation of factors, and yours less for calculation of the overall product.

back2dos 2010-04-23 09:28:19

ansaurus

tags:

views:

answers:

Problem with Precision floating point operation in C

related questions