views:

304

answers:

4

I just wrote this function to break a number into its prime factors. Is the algorithm I'm using a good one? How might I improve this function? So far it seems to be working quite well.

from copy import copy

# Decorator to memoize using the first argument only
@memoize_first
def prime_factors(n, prime_list=[None], prime_limit=[None]):
    """
    Get the prime factors of n.
    Returns a list of all factors
    Uses trial division
    """
    if n == 1: return None
    if is_prime(n): return None

    # Use mutability of default arguments to avoid recalculating the
    #  prime list unless we're called with a higher n than we've seen before
    if prime_limit == [None]:
        prime_limit[0] = n
    if prime_list == [None] or prime_limit[0] < n:
        prime_list.extend(list(primes(n))) # standard seive-based primes()
        prime_list.remove(None)
        prime_limit[0] = n

    factor_list = []

    # Copying because otherwise removing already tested 
    #  primes will mutate the cached list
    test_primes = copy(prime_list)

    while 1:
        # Pull first element because we mutate the list in the loop,
        #  so a for loop won't work
        prime = test_primes[0]
        # Prime list is ordered, so we've got a cached list larger 
        #  than the number to be tested
        if prime > n: break
        if n % prime == 0:
            remainder = n / prime # Relying on 2.6 floor division
            factor_list.append(prime)
            if is_prime(remainder):
                factor_list.append(remainder)
            else:
                factor_list.extend(prime_factors(remainder, test_primes))
            break
        else:
            test_primes.remove(prime)

    return filter(None, factor_list)

Edit

I accepted Alex's answer as it matches the intent of what I started with the most (memoizing). Everyone's answer was very enlightening though, and there's definitely wiser ways to do it. Thank you all!

+4  A: 

If by good, you mean "readable", then no. I have no idea how your code works, or if it does. My motivation to see how clever your optimization might be is None because I'm intentionally adopting the role of someone who has to read your code. It has a list copy, a list remove, and is recursive, it also does too many things, and you didn't bother to show your imports, particularly the decorator.

By those criteria, this is a good implementation:

def factor(n):
    """returns a list of the prime factors of integer n"""
    factors = []
    p = primes.generator() # yields an infinite list
    while n > 1:
        x = p.next()
        while n % x == 0:
            n = n/x
            factors.append(x)
    return factors

which lives in a module called primes so the name when used is primes.factor.

added for contrast:

I actually wrote the primes.factor function above when the GNU core-utils factor program declined to take inputs above ~ 10^20. In it, I a naive prime generator which was far too slow, so I surfed for a faster implementation and found:

def generator():
    """returns a generator that yields an infinite list of primes"""
    return itertools.ifilter(
        lambda n, Primes=[]:
            all(n%p for p in
                itertools.takewhile(lambda p, s=n**0.5: p <= s, Primes)
            ) and not Primes.append(n),
        itertools.count(2)
    )

which may have come from Alex Martelli's Python Cookbook but had lost its attribution by the time I found it. I don't know how it works. Since I already had the unit test in place for my naive generator and my factor routine, I am satisfied that it works for my purposes, but cannot claim that it works outside the bounds of my tests.

Since my intended use was as a replacement for GNU factor - by which I mean the occasional one shot use when helping my kid with her homework, memoization isn't useful. Indeed, if I had much factoring or prime generation to do, I wouldn't do it in Python. So there's another definition of "good": is it worth optimizing the hell out of something that perhaps shouldn't be done anyway?

added for clarity:

Upon rereading this multilogue, It appears that I may have been calling the implementation of generator here "bad" and hinted that it may have come from Alex's book. If it had (I don't have a copy available) it would not be bad - no matter who wrote it - in the context of being a well discussed, much tested opaque code snippet. If it did appear in the book, it would have also been in the context of a discussion and so would not have been "bad" there, either.

msw
My algorithm is mostly the same as yours. The differences are that instead of looping on n like you do, I recurse on it instead to allow memoization. My `primes()` function is the seive of Erastosthenes as I comment. `is_prime()` should be obvious, and the memoizer is commented as for function. I didn't list all imports because I'm asking mostly about the algorithm, and as long as the called functions work as expected (they do), they don't matter. I'll update the question though with more info.
Daenyth
"My algorithm is mostly the same as yours." - I'd probably agree if I bothered to read your implementation, which kinda suggests you missed the point of my answer. I've added to my answer to perhaps increase the contrast between "good" and "bad" as I have defined them here.
msw
Thanks for the clarification. By the way, the generator you posted works by starting at 2, then maintaining a list of primes and for each number in increasing order, trial divide by all the primes it already has, then storing it if it doesn't divide by any known primes.
Daenyth
For efficiency's sake you should break the loop when you exceed the square root of n, then n is prime. This is a big win, from O(n) to O(sqrt(n)). It seems the other answers have the same issue, but it is harder to spot since they are less readable.
starblue
@starblue: forgive me, as the coffee isn't up yet, but I agree in principle with your comment yet a quick test showed factor(1189) ⇒ 29 41; sqrt(1189) ~ 34.4; 41 > 34.4; I'm thinking aloud here, so you actually mean when x > sqrt(**residual**), yes?
msw
You test divisibility by all the primes less than sqrt(number), keep dividing number /= prime every time you find a factor, stop trying when you reach sqrt(number), if number > 1 after all that, means number = a new prime (probably not in you original list of primes).In your example above number = 1189, you try to divide 1189 by all primes from 2..31 proving that what is left 41 is prime in process.You can think of that as if you tried to divide 41 by all the primes in the range 2..31.(which you actually did!)
Robert William Hanks
Yes, I mean the residual (it is called n in your first procedure).
starblue
+4  A: 

I like @msw's answer for readability and solid structure: the generator of primes should live in its own function (and be memoized by itself) since that stream of primes can be used for whatever purpose is appropriate.

Without losing readability, factor can also be explicitly memoized -- I would not use a opaque decorator for the purpose (maybe leading to very inefficient recursion to try and exploit the memoization, as the OP mentions in a comment!), but one at a lower layer of abstraction, and therefore more under my control. E.g.... (note: "richly commented" version later if you find the comment-less version below hard to read - rather than, as I do, finding it easier to read thanks to its relative conciseness):

def factor(n, _memo={1: []}):
    """returns a list of the prime factors of integer n"""
    if n <= 0: raise ValueError("Can't factor %r" % n)
    localmemo = {}
    orgn = n
    p = primes.generator()  # yields an infinite iterable
    for x in p:
        if n in _memo:
            for k in localmemo:
                localmemo[k].extend(_memo[n])
            _memo.update(localmemo)
            return _memo[orgn]
        localmemo[n] = []
        if n % x == 0:
            n = n/x
            for k in localmemo:
                localmemo[k].append(x)
            p.send(x)  # get `x` again next time

Edit: original version didn't work correctly for numbers with a given prime factor occurring more than once -- here I'm assuming that the primes generator accepts a send which works as a push-back in order to yield that prime again at the next next (otherwise you'd have to nest another loop within the for to use that x factor as many times as necessary).

It's trickier to keep memos updated when going "top down" (as iteration naturally does here) than when going "bottom up" (as recursion will make things go), so I might have made some error in this -- and of course the readability won't be as good as @msw's clean solution -- but such optimizations (in different real-world cases, probably -- for actual factorization in the real world I'd use gmpy of course;-) may be worth some small sacrifice in readability and effort at perfecting the optimized (and thus less-clear) logic.

Edit: here's a longer version with more comments, for readers who prefer that style:

def factor(n, _memo={1: []}):
    """returns a list of the prime factors of integer n

       n must be > 0 (otherwise, raises ValueError).

       uses a stream of primes in increasing order generated
       by `primes.generator()` (q.v.).

       do **not** pass argument _memo, it's used for memoization
       (holding the list of factors for all ints that have already
        been factorized by this function in this process's past).
    """
    # get error cases out of the way first
    if n <= 0: raise ValueError("Can't factor %r" % n)
    # localmemo records all numbers which are being factorized
    # for the first time in this specific call to `factor`, each
    # with a list of corresponding factors found so far
    localmemo = {}
    # keep a copy of the original n since in the loop below n
    # gets decreased
    orgn = n
    p = primes.generator()  # yields an infinite iterable
    # look at each prime (the .send call below may cause a prime
    # to be looked at more than once since it's assumed to work
    # as a "push back" for this specific generator)
    for x in p:
        if n in _memo:  # we've factorized n already in the past
            # (or n is 1, which is always a key in _memo!)
            # so we're all done, mop up
            # every list of factors in localmemo gets all n's factors
            for k in localmemo:
                localmemo[k].extend(_memo[n])
            # add every localmemo list to _memo for future calls
            _memo.update(localmemo)
            # now orgn is in _memo (as it was in localmemo if it had
            # not already been in _memo it's been added) so we can just
            # index to get the corresponding list of factors
            return _memo[orgn]
        # start with an empty list since we don't know n's factors yet
        localmemo[n] = []
        if n % x == 0:  # x is a factor of n, so of everything we're factoring
            n = n/x
            for k in localmemo:
                localmemo[k].append(x)  # ...so add it to every entry in localmemo
            p.send(x)  # get `x` again next time (it might be a multiple factor!)

Edit: to add single-level pushback functionality to a generator that doesn't have it...:

def withpushback(aniterator):
  pushback = None
  while True:
    if pushback is not None:  # last client action was `send`
      to_yield = pushback
      while pushback is not None:
        pushback = yield to_yield  # iterate until a `next`!
    else:                     # last client action was `next`
      try: to_yield = next(aniterator)
      except StopIteration: break
    pushback = yield to_yield

# an example use...:
w = withpushback(iter(range(7)))
repetitions = {2: 3, 5: 2}
for p in w:
  print p,
  if repetitions.get(p):
    w.send(p)
    repetitions[p] -= 1
print

This shows 0 1 2 2 2 2 3 4 5 5 5 6 -- i.e., three reps of 2 and two of 5, as the repetitions dict specified. (The assumed pattern of accesses is alternating next and send, though with an attempt to survive multiple sends in a row;-).

Alex Martelli
Wow, your version is much better than the one I have. I didn't think of storing it in a local dict like that. Thanks for the `gmpy` tip, didn't know about that library. Very clear and concise answer :)
Daenyth
@Daenith, tx for the kudos -- though @msw's answer is shorter and thereby clearer (but admittedly it doesn't do memoizing, which is why I posted my own to show how to add memoizing to that kind of logic).
Alex Martelli
@Alex: confused how this works -- add comments? I would expect when n = 1 then the factor() method stops, but I don't see that here.
Jason S
@Jason, when n = 1 it gets found in the `_memo` (with a corresponding empty list of factors). I'm not sure commenting to repeat such information as "by the way, 1 is always present as a key in `_memo`" (which is totally obvious from reading the code -- less than 20 lines, so easily fitting even on a smallish screen and readable "at a glance") is a good idea -- if I commented repeating everything that's clearly visible in the code I'd end up with at least 40 lines, actually hampering readability except maybe on very large screens; what do you think _should_ be said twice or more?
Alex Martelli
@Jason, ah well, might as well put in a _second_ version (with very abundant comments) for clarity. Let me edit my A for the purpose...
Alex Martelli
@Alex: To correctly handle duplicate factors, I think `if n % x == 0:` might have to be `while n % x == 0:`.
unutbu
@unutbu, not quite sufficient, as in the if or while there is no check for n's presence in `_memo` -- so I've punted assuming the prime generator accepts a pushback so the same factor can be considered as many times as necessary (see the edit I just did to add an abundantly-commented version, which some readers apparently prefer to an actually readable one -- I took the opportunity to "kind of fix" this glitch there;-).
Alex Martelli
@Alex: How would you implement `primes.generator`? Here is my attempt: http://paste.ubuntu.com/466102/. It doesn't work however, since when you send the nth prime, it yields the (n+2)th prime on the next iteration.
unutbu
@unutbu, yep, the `next_prime` must not be called if there's a pushback. Worthy of another Q, but I guess I'll just edit this A to show how to implement pushback by generator wrapping.
Alex Martelli
@Alex: Thank you, Alex! That's very enlightening.
unutbu
@unutbu, you're welcome, tx for prompting me to write it down;-).
Alex Martelli
+3  A: 

Here some code for you to compare with, in general if you are going to use a list of primes to factor the number n, you will only need a list of primes <= int(sqrt(n))
For the small range of number that you want you can calculate the list of primes only once, and work with that list, avoiding recalculation or generating the primes over and over again every time you want the factorization of a number. ( i am assuming that you want to calculate lost of facttorization)

def primes1(n):
    """ Returns a list of primes < n """
    sieve = [True] * (n/2)
    for i in xrange(3,int(n**0.5)+1,2):
        if sieve[i/2]:
            sieve[i*i/2::i] = [False] * ((n-i*i-1)/(2*i)+1)
    return [2] + [2*i+1 for i in xrange(1,n/2) if sieve[i]]

def listfactorization(n):
    """ Returns a list of the prime factorization of n """
    pf = []
    for p in primeslist:      # this is my main point (different from OP & msw's)
      if p*p > n : break      # "if p > n: break", is a waste as in others answers
      while not n % p:        # only test divisibility up to sqrt(n)
        n /= p                
        pf.append(p)          # main point again (complementary of the above)
    if n > 1: pf.append(n)    # if n is still > 1 implies n is prime so append
    return pf                 

primeslist = primes1(10**6)   #to factorize numbers upto 10e12

#or 

primeslist = primes1(10**7)   #to factorize numbers upto 10e14

The fact is: to factor any number you only need to test primes up to sqrt(number).

To work with numbers up to 10e14, do primeslist = primes(10**7)

In general to work with numbers up to n, do primeslist = primes(int(n**0.5))

not recommend to try to generate with this code a primeslist greater than 10**7

That is just a simple pure python solution feasible to work with numbers up to 10e14,
for a faster one capable o reaching bigger numbers use numpy.
If you want you can turn the last code into a generator.

Robert William Hanks
That function is what I started with, and the rest of my differences are just using recursion to memoize it, and the bookkeeping surrounding it. Thanks though!
Daenyth
+2  A: 

If you wish to factorize very large numbers, use a different algorithm. The three (asymptotically) fastest factorization algorithms are

  1. general number field sieve
  2. multiple polynomial quadratic sieve
  3. elliptic curve factorization

I don't know of a Python implementation for the first two, though you'll find numerous implementations in C/C++ at the above links.

Here is a Python implementation of the elliptic curve factorization method. It factored 183758673828099 in about 0.27 seconds:

time pyecm 183758673828099
Factoring 183758673828099:
3
775353054127
79

real    0m0.268s
user    0m0.252s
sys 0m0.008s
unutbu
don't see the point, the time to make a list of primes up to 10**7 is 1,5s im my old machine, after that my solution should be almost instant(and that is only pure python code). worst case is 10**14-27.
Robert William Hanks
Thanks for pointing out the algorithms!
Daenyth
@~unutbu : nope that's what i am trying to say to everyone in this question, no one(including you) are listen, copy and paste my code in any python using primeslist = primes1(10**7), test your number and see for yourself.
Robert William Hanks
@~unutbu: you only need primeslist=primes1(10**7) to factor any number up to 10e14
Robert William Hanks
not too shabby, although - not surprisingly - GNU factor, a C implementation of Pollard's rho is about 17 times faster on my machine. Contrariwise, my implementation, which may finish in another hour or so (taking into account the 12 graduate credit hours I'd probably need to understand ECM) is approximately 1.3*10^5 times more efficient.
msw
@Robert: Your program happens to work for 183758673828099, but I don't think it is guaranteed to work for numbers that large. For example, `listfactorization(100000980001501)` returns `[100000980001501L]` while the correct answer is `10000019*10000079`
unutbu
@msw: finish what calculation "in another hour or so"?
Robert William Hanks
@Robert: factor 183758673828099 ;)
msw
@~unutbu: is you are correct that number in bigger than 10e14, my solution works in general, my program works for the nunber n if primeslist = primes(int(n**0.5)) as is stated in my answer. OP question required working solution up to 10e14 see his comments.
Robert William Hanks
@~unutbu: the fact is to factor any number you only need to test primes up to sqrt(number).
Robert William Hanks
@msw: my pure python solution below works in 0,1s or less than 2s if you count the time to make a list of primes up to 10**7
Robert William Hanks
i apologise to you guys up front if i am sounding rude, my english is not very good, difficult to express myself.
Robert William Hanks
@msw: my bad to factorize 183758673828099 you need primes up to int(183758673828099**0.5) = 13555761, so primeslist = primes1(13555761) in my code, wich takes ~2,5s total.
Robert William Hanks
@robert: Agreed. Don't worry, native English speakers don't understand my jokes either. Just to be clear I was saying that my algorithm is really very, very inefficient, but only needs the arithmetic of a child.
msw