views:

187

answers:

5

I need to do a binomial test in Python that allows calculation for 'n' numbers of the order of 10000.

I have implemented a quick binomial_test function using scipy.misc.comb, however, it is pretty much limited around n = 1000, I guess because it reaches the biggest representable number while computing factorials or the combinatorial itself. Here is my function:

from scipy.misc import comb
def binomial_test(n, k):
    """Calculate binomial probability
    """
    p = comb(n, k) * 0.5**k * 0.5**(n-k)
    return p

How could I use a native python (or numpy, scipy...) function in order to calculate that binomial probability? If possible, I need scipy 0.7.2 compatible code.

Many thanks!

A: 

Not specifically a Python solution, but if you can deal with small fractional errors, you might try using Stirling's approximation for n!:

comb(n, k) = n!/(k! * (n-k)!), where n! is approximately sqrt(2*Pi*n)*(n/e)^n for large n.

For n>1000 the fractional errors should be very small.

For the probability calculation with large n, use logarithms for intermediate results:

log p = log(comb(n, k)) - n * log(2)

p = exp(log(p))

pwaldron
Using 10000! is going to be pretty heavy... is there not a way to avoid this? I'll have to use this test up to 10000s of times, so speed is an issue. Thanks!
Morlock
@Morlock: Look into using memoization if you're going to have repeated calls to functions performing heavy calculations
Daenyth
I don't think you're reading the expression correctly. Stirling's formula can be done manually on a pocket calculator in only a few seconds.
pwaldron
I get problems trying to calculate comb(n, k) for n > 1000. This is the original reason why I try to find an alternative to my code, as seen in the question, and which uses comb(n, k)... Cheers!
Morlock
+6  A: 

Edited to add this comment: please note that, as Daniel Stutzbach mentions, the "binomial test" is probably not what the original poster was asking for (though he did use this expression). He seems to be asking for the probability density function of a binomial distribution, which is not what I'm suggesting below.

Have you tried scipy.stats.binom_test?

rbp@apfelstrudel ~$ python
Python 2.6.2 (r262:71600, Apr 16 2009, 09:17:39) 
[GCC 4.0.1 (Apple Computer, Inc. build 5250)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from scipy import stats
>>> print stats.binom_test.__doc__

    Perform a test that the probability of success is p.

    This is an exact, two-sided test of the null hypothesis
    that the probability of success in a Bernoulli experiment
    is `p`.

    Parameters
    ----------
    x : integer or array_like
        the number of successes, or if x has length 2, it is the
        number of successes and the number of failures.
    n : integer
        the number of trials.  This is ignored if x gives both the
        number of successes and failures
    p : float, optional
        The hypothesized probability of success.  0 <= p <= 1. The
        default value is p = 0.5

    Returns
    -------
    p-value : float
        The p-value of the hypothesis test

    References
    ----------
    .. [1] http://en.wikipedia.org/wiki/Binomial_test


>>> stats.binom_test(500, 10000)
4.9406564584124654e-324

Small edit to add documentation link: http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binom_test.html#scipy.stats.binom_test

BTW: works on scipy 0.7.2, as well as on current 0.8 dev.

rbp
What versions of numpy and scipy do you have installed? The __doc__ part on my system (python 2.6.4, numpy 1:1.3.0-3, scipy 0.7.2) is different and I get 'binom_test(500, 10000) = 0.99999999999999989'.It should be easy to install numpy and scipy's most recent versions on ubuntu, only it ain't...
Morlock
Same thing here on OS X with Python 2.6.5, numpy 1.4.1, scipy 0.7.0: binom_test(500, 10000) = 0.99999…
EOL
I have the latest, numpy 1.4.1 and scipy 0.8.0b1. The online docs for scipy 0.7.2 are slightly different, but seem to mean the same thing: http://docs.scipy.org/doc/scipy-0.7.x/reference/generated/scipy.stats.binom_test.html#scipy.stats.binom_test . But I've just tested on a Debian machine, with Python 2.5.4, numpy 1.2.1 and scipy 0.7.0, and I get the same result as you (0.99999999999999989). Perhaps a bug on older versions of scipy? http://projects.scipy.org/scipy/ticket/986
rbp
That's not the function he's looking for. Based on his sample code, he's looking for the probability density function of a binomial distribution. (I'm not downvoting you since he says he's looking for a "binomial test" function, which confusing)
Daniel Stutzbach
Daniel-Wan: Makes sense, actually. I admit I didn't pay much attention to his code at first, I just read "binomial test", noticed he was running on an overflow, thought "Hey, I think scipy does that already" and went looking for it. I'll add a comment to my answer, and think about it some more :)
rbp
Actually, assuming this is what the original poster wanted, you seem to have provided a very nice answer already. I'm upvoting that, and waiting for Morlock to show up.
rbp
+1  A: 

I would look into the GNU Multi-Precision package (gmpy), which allows you to perform arbitrary precision calculations: you could probably do:

comb(n, k, exact=1)/2**k/2**(n-k)

but with the long integers of gmpy.

In fact, if you use exact integer computations, you can easily reach n=10000 for the combinations part; for this, you must use:

comb(n, k, exact=1)

instead of the floating point approximation comb(n, k), which overflows.

However, as the Original Poster noted, the returned (long) integer may be too long to be multiplied by a float!

Furthermore, one quickly runs into another problem: 0.5**1000=9.3…e-302 is already very close to the float underflow…

In summary: if you really need precise results for all k for n~10,000, you need to use a different approach than the formula from the original post, which suffers from the limitations of double precision floating point arithmetics. Using gmpy as indicated above could be a solution (not tested!).

EOL
Here is what I get when trying to use the result of comb(10000, 400, exact=1) : OverflowError: long int too large to convert to float :)
Morlock
… I also get this, but only when performing the multiplication. You must find a different approach than the original formula, because double precision floating point operations just cannot perform the required math.
EOL
You mean the problem is that I am trying to multiply a very large number with a very small one? I guess this is why I want a properly put up binomial test. :)
Morlock
The result of comb() is a long integer, and is exact. In order to multiply it by a float, a conversion of this number to a float is attempted, but it fails because floats are limited to something close to 1e300.
EOL
+5  A: 

Any solution that looks like comb(n, k) * 0.5**k * 0.5**(n-k) isn't going to work for large n. On most (all?) platforms, the smallest value a Python float can store is around 2**-1022. For large n-k or large k, the right-hand side will get rounded to 0. Likewise, comb(n, k) can grow so large that it will not fit in a float.

A more robust approach is to compute the probability density function as the difference between two consecutive points in the cumulative distribution function, which can be computed using the regularized incomplete beta function (look in SciPy's "special functions" package). Mathematically:

pdf(p, n, k) = cdf(p, n, k) - cdf(p, n, k-1)

Another option is to use the Normal approximation, which is quite accurate for large n. If speed is a concern, this is probably the way to go:

from math import *

def normal_pdf(x, m, v):
    return 1.0/sqrt(2*pi*v) * exp(-(x-m)**2/(2*v))

def binomial_pdf(p, n, k):
    if n < 100:
        return comb(n, k) * p**k * p**(n-k)  # Fall back to your current method
    return normal_pdf(k, n*p, n*p*(1.0-p))

I haven't tested the code, but that should give you the general idea.

Daniel Stutzbach
+1 for the normal approximation which should be near perfect. But your example code seems wrong. For the normal approximation, you have to take differences of the cumulative density function, not return the probability density function at the point. I.e. something like this: `norm.cdf(k + 0.5, n * p, sqrt(n * p * (1-p))) - norm.cdf(k - 0.5, n * p, sqrt(n * p * (1-p)))`. Also, when deciding whether to choose the exact or the approximate solution, you should take both n and p into account (see the rules of thumb in your wikipedia link).
stephan
@stephan: Do you have a good reference for the need to use the difference of the CDFs? It sounds plausible, but I didn't want to add complexity if I couldn't justify it. For large n and extreme p, we're stuck with the lesser of two evils. The fallback method is inaccurate for large n due to floating point limitations.
Daniel Stutzbach
@Daniel: I withdraw the word "wrong" and replace with "less accurate" :-) The issue is the "continuity correction". For example, just look at how your approximation will perform for k=3 in the sample chart on your normal approximation wikipedia link. Take a look at this book http://books.google.com/books?id=zoVLF0VF9UYC (you can preview it), section 7.1.2.1 on p. 180 following: my formular is an application of the first formula on p. 181 with a = b. In the book, you will find a lot of even better approximations like Camp-Paulson in section 7.1.7.
stephan
@stephan: The error at k=3 in that chart is due to the small n and will appear using the CDF as well: the area under the curve between k=2.5 to k=3.5 is too large. Try plotting n=100. :-)
Daniel Stutzbach
@Daniel: sure, but the error with continuity correction (cdf) will be smaller on average than pdf. Point taken, however: for n = 100 and p = 0.5, your approximation is good enough, so why bother with better but more complicated approximations. Let me know whether I shall delete my comments.
stephan
@stephan: Please keep them. Someone may someday find this post and need the pointers to more accurate methods. Thanks for pointing out the Camp-Paulson from that book, by the way. I hadn't known about that.
Daniel Stutzbach
+1 on keeping @stephan's comments. The discussion was enlightening :)
rbp
@Daniel: your `normal_pdf` already exists in SciPy: it's called `scipy.stats.norm.pdf()`.
EOL
+2  A: 

GMPY also supports extended precision floating point calculations. For example:

>>> from gmpy import *
>>>
>>> def f(n,k,p,prec=256):
...     return mpf(comb(n,k),prec) * mpf(p,prec)**k * mpf(1-p,prec)**(n-k)
...
>>> print(f(1000,500,0.5))
0.0252250181783608019068416887621024545529410193921696384762532089115753731615931
>>>

I specified a floating point precision of 256 bits. By the way, source forge version is way out of date. The current version is maintained at code.google.com and supports Python 3.x. (Disclaimer: I'm the current maintainer of gmpy.)

casevh

casevh