views:

864

answers:

7

Up for consideration is the following function which can be used to (relatively quickly) factor a 64-bit unsigned integer into its prime factors. Note that the factoring is not probabalistic (i.e., it is exact). The algorithm is already fast enough to find that a number is prime or has few very large factors in a matter of several seconds, on modern hardware.

The question: Can any improvements be made to the algorithm presented, while keeping it single-threaded, so that it can factor (arbitrary) very large unsigned 64-bit integers faster, preferably without using a probabalistic approach (e.g., Miller-Rabin) for determining primality?

// system specific typedef for ulong should go here (or use boost::uint64_t)
typedef unsigned __int64 ulong;
typedef std::vector<ulong> ULongVector;

// Caller needs to pass in an empty factors vector
void GetFactors(ULongVector &factors, ulong num)
{
  // Num has to be at least 2 to contain "prime" factors
  if (num<2)
    return;

  ulong workingNum=num;
  ulong nextOffset=2; // Will be used to skip multiples of 3, later

  // Factor out factors of 2
  while (workingNum%2==0)
  {
    factors.push_back(2);
    workingNum/=2;
  }

  // Factor out factors of 3
  while (workingNum%3==0)
  {
    factors.push_back(3);
    workingNum/=3;
  }

  // If all of the factors were 2s and 3s, done...
  if (workingNum==1)
    return;

  // sqrtNum is the (inclusive) upper bound of our search for factors
  ulong sqrtNum=(ulong) sqrt(double(workingNum+0.5));

  // Factor out potential factors that are greate than or equal to 5
  // The variable n represents the next potential factor to be tested
  for (ulong n=5;n<=sqrtNum;)
  {
    // Is n a factor of the current working number?
    if (workingNum%n==0)
    {
      // n is a factor, so add it to the list of factors
      factors.push_back(n);

      // Divide current working number by n, to get remaining number to factor
      workingNum/=n;

      // Check if we've found all factors
      if (workingNum==1)
        return;

      // Recalculate the new upper bound for remaining factors
      sqrtNum=(ulong) sqrt(double(workingNum+0.5));

      // Recheck if n is a factor of the new working number, 
      // in case workingNum contains multiple factors of n
      continue;
    }

    // n is not or is no longer a factor, try the next odd number 
    // that is not a multiple of 3
    n+=nextOffset;
    // Adjust nextOffset to be an offset from n to the next non-multiple of 3
    nextOffset=(nextOffset==2UL ? 4UL : 2UL);
  }

  // Current workingNum is prime, add it as a factor
  factors.push_back(workingNum);
}

Thanks

Edit: I've added even more comments. The reason that a vector is passed in by reference, is to allow for the vector to be reused in between calls and avoid dynamic allocations. The reason the vector is not emptied in the function, is to allow for the odd requirement of appending the current "num's" factors to factors already in the vector.

The function itself is not pretty and can be refactored, but the question is about how to making the algorithm faster. So, please, no suggestions about how to make the function more pretty, readable, or C++ish. That's child's play. Improving this algorithm, so that it can find (proven) factors faster is the difficult part.

Update: Potatoswatter has some excellent solutions so far, be sure to check out his MMX solution near the bottom, as well.

+1  A: 

This code is rather slower, and I'm pretty sure I understand why. It's not incredibly slower, but definitely slower in the 10-20% range. The division should not be done once for every loop through, but the only way to do that is to call sqrt or something similar.

// system specific typedef for ulong should go here (or use boost::uint64_t)
typedef std::vector<ulong> ULongVector;

void GetFactors(ULongVector &factors, ulong num)
{
  if (num<2)
    return;

  ulong workingNum=num;
  ulong nextOffset=2;

  while (workingNum%2==0)
  {
    factors.push_back(2);
    workingNum/=2;
  }

  while (workingNum%3==0)
  {
    factors.push_back(3);
    workingNum/=3;
  }

  ulong n = 5;
  while ((workingNum != 1) && ((workingNum / n) >= n)) {
    // Is workingNum divisible by n?
    if (workingNum%n==0)
    {
      // n is a factor!
      // so is the number multiplied by n to get workingNum

      // Insert n into the list of factors
      factors.push_back(n);

      // Divide working number by n
      workingNum/=n;
    } else {
      n+=nextOffset;
      nextOffset=(nextOffset==2UL ? 4UL : 2UL);
    }
  }

  if (workingNum != 1) {
    // workingNum is prime, add it to the list of factors        
    factors.push_back(workingNum);
  }
}
Omnifarious
The code alternates between 2 and 4 in order to skip numbers divisible by 3, since all factors of 3 have already been factored out. Thank you for your effort. I will give your approach a try later today and time it vs the original, to see if there is an incremental improvement in speed with this refactoring of the code...
Michael Goldshteyn
@Michael Goldshteyn - I just finished verifying that the offset trick you're doing will indeed filter out only factors of 3. So yes, it will work.
Omnifarious
elimination of `sqrt` == good. Moving the workingNum == 1 and division into the loop condition == bad.
Ben Voigt
@Ben Voigt - Unfortunately, I don't think elimination of the sqrt really helps. That can be done in a single instruction fairly quickly on modern processors.
Omnifarious
@Ben: regarding the division in the loop, if he's really luck Omnifarous could benefit from an instruction like `idiv` that does division and modulus together. Since the modulus is necessary, that would make the division basically free.
Steve Jessop
@Steve Jessop: An optimising compiler *might* be smart enough to figure that out on its own.
Greg Hewgill
@Greg: sorry, that's what I meant - he needs to have `idiv`, and his compiler has to use both results of a single op. If he's not lucky, his compiler doesn't figure it out, and does the same `idiv` twice (then again if `n` is a factor). I guess you thought I meant he could stick some asm in there? Which isn't a bad idea if the compiler hasn't handled it, just not what I was trying to suggest.
Steve Jessop
OK, let's keep inline assembly out of the equation. I was hoping more for algorithmic improvements or at least improvements within the confines of C++.
Michael Goldshteyn
@Omnifarious: single instruction doesn't mean fast. Some instructions have much higher latency than others (although with pipelining, speed is as much dependent on the inherent dependencies as on the actual instructions used). Besides, the old code did an int->float conversion, float addition, float sqrt, then float->int conversion, which is quite a bit more than just one instruction.
Ben Voigt
The reason that `sqrt` was so fast is because it was called once per factor, not once per iteration. I did anticipate that your divisions would take longer than the sqrt for that reason.
Ben Voigt
Also, I wonder how much it would help to make `n` only half the width of `num`.
Ben Voigt
@Greg Hewgill - I was hoping that it would figure it out on its own. I'm using gcc 4.5 to test. The assembly code was confusing enough that I couldn't quite tell. It really re-arranged things.
Omnifarious
@Ben Voigt - Integer square root using newton ralphson wouldn't take all that long.
Omnifarious
You guys are too obsessed about the role of the square root calculation. But, as mentioned already, this is only done once per factor and there can be at most 63 factors versus the billions of potential mod tests. So, please, concentrate on the bottleneck(s) in the "hot" part of the code.
Michael Goldshteyn
I implemented an approximate integer sqrt using gcc's `__builtin_clzll` function and shifting slightly fewer than half the bits out. It was a lot slower.
Omnifarious
@Omnifarious: not surprised. Hardware support for sqrt helps **a lot**. But yeah, that's the approach I would take if I didn't have an FPU. Note that using a lower bound on the sqrt (such as the next lower power-of-2) is good enough for finding all but the very last factor, once `n` reaches that number you could use binary search to find the real sqrt and you'd only have to do that once.
Ben Voigt
http://bits.stephan-brumme.com/squareRoot.html
Ben Voigt
+1  A: 

Incorporating some ideas from Omnifarious, plus other improvements:

// system specific typedef for ulong should go here (or use boost::uint64_t)
typedef unsigned __int64 ulong;
typedef std::vector<ulong> ULongVector;

// Caller needs to pass in an empty factors vector
void GetFactors(ULongVector &factors, ulong num)
{
  if (num<2)
    return;

  ulong workingNum=num;

  // Factor out factors of 2
  while (workingNum%2==0)
  {
    factors.push_back(2);
    workingNum/=2;
  }

  // Factor out factors of 3
  while (workingNum%3==0)
  {
    factors.push_back(3);
    workingNum/=3;
  }

  if (workingNum==1)
    return;

  // Factor out factors >=5
  ulong nextOffset=2;
  char nextShift = 1;
  ulong n = 5;
  ulong nn = 25;
  do {
    // Is workingNum divisible by n?
    if (workingNum%n==0)
    {
      // n is a factor!
      // so is the number multiplied by n to get workingNum

      // Insert n into the list of factors
      factors.push_back(n);

      // Divide working number by n
      workingNum/=n;

      // Test for done...
      if (workingNum==1)
        return;

      // Try n again
    }  
    else {
      nn += (n << (nextShift+1)) + (1<<(nextShift*2)); // (n+b)^2 = n^2 + 2*n*b + b*2
      n += nextOffset;
      nextOffset ^= 6;
      nextShift ^= 3;
      // invariant: nn == n*n
      if (n & 0x100000000LL) break; // careful of integer wraparound in n^2
    }
  } while (nn <= workingNum);

  // workingNum is prime, add it to the list of factors        
  factors.push_back(workingNum);
}
Ben Voigt
You need to swap the `n +=` and `nn +=` lines -- incrementing n first results in an nn value that is a little bit too large, so you might end the loop too soon. Try you program factoring 121 (11^2) to see.
Chris Dodd
@Chris: You are correct, good catch. Could also fix it by changing the sign on the final term (n^2 = (n-b)^2 + 2*n*b - b^2), but nice bonus of this order is that it will pipeline better because it isn't dependent on the just-calculated value of n.
Ben Voigt
Also found that parens are needed in the nn update: http://ideone.com/jOqzs
Ben Voigt
@PigBen: Yeah I just found that problem, missing parentheses. Would you try again?
Ben Voigt
Downvote removed. Retesting speed to see if it deserves an upvote.
PigBen
@Ben Voigt - Unfortunately, according to my testing yours is slower than mine, and takes nearly twice as long as the OPs. Mine takes approximately 10% longer. `sqrt` being a fast single instruction operation on modern CPUs really makes removing it not that big of an improvement.
Omnifarious
Strange. How does it compare to nuke `nn` and `nextShift` and just do `n*n` in the while condition? I guess that the term `(1<<nextShift*2)` doesn't actually need to be computed at runtime, it could be toggled between the two values (4, 16) using xor as well. Also would have to see what code the compiler came up with for the wraparound test.
Ben Voigt
+1  A: 

The natural generalization is to precompute skippages using more known primes than just 2 and 3. Like 2, 3, 5, 7, 11, for a pattern period of 2310 (huh, nice number). And perhaps more, but it has diminishing returns -- a graph of run times could establish exactly where the precomputation starts to have negative impact, but of course it depends on the number of numbers to factor...

Hah, I leave the coding details to you folks. :-)

Cheers & hth.,

– Alf

Alf P. Steinbach
The point of diminishing returns is going to be determined by the icache probably, which will be an unrolling factor quite a bit smaller than 2000.
Ben Voigt
@Ben: note that with a pattern period of 6 (for the known primes 2 and 3) there are just 2 skip numbers, namely 2 and 4, so the data retained after precomputation is very much less than the period. I think the only think to do is measure. It's very hard to predict such things! :-)
Alf P. Steinbach
I had the same thought, but the efficiency increase is significantly less (I suspect n^-x) for each set of skips that you add. So I think the point of diminishing returns would come fairly. You might get away with adding 5, and maybe 7, but I bet after that it's not worth it. The only way though, as you point out, is to measure.
Omnifarious
I tried many ideas, but they only slow the function down. I think the choice of literal (2 or 4) can be precompiled into the code, without any use of non-register or literal memory. This is not true of a lookup table, which would incur a cache hit - significantly slower than registers or literals baked into the actual instructions.
Michael Goldshteyn
+14  A: 

Compare such an approach to a (pre-generated) sieve. Modulo is expensive, so both approaches essentially do two things: generate potential factors, and perform modulo operations. Either program should reasonably generate a new candidate factor in less cycles than modulo takes, so either program is modulo bound.

The given approach filters out a constant proportion of all integers, namely the multiples of 2 and 3, or 75%. One in four (as given) numbers is used as an argument to the modulo operator. I'll call it a skip filter.

On the other hand, a sieve uses only primes as arguments to the modulo operator, and the average difference between successive primes is governed by the prime number theorem to be 1/ln(N). For example, e^20 is just under 500 million, so numbers over 500 million have under a 5% chance of being prime. If all numbers up to 2^32 are considered, 5% is a good rule of thumb.

Therefore, a sieve will spend 5 times less time on div operations as your skip filter. The next factor to consider is the speed at which the sieve produces primes, i.e. reads them from memory or disk. If fetching one prime is faster than 4 divs, then the sieve is faster. According to my tables div throughput on my Core2 is at most one per 12 cycles. These will be hard division problems, so let's conservatively budget 50 cycles per prime. For a 2.5 GHz processor, that's 20 nanoseconds.

In 20 ns, a 50 MB/sec hard drive can read about one byte. The simple solution is to use 4 bytes per prime, so the drive will be slower. But, we can be more clever. If we want to encode all the primes in order, we can just encode their differences. Again, the expected difference is 1/ln(N). Also, they're all even, which saves an extra bit. And they are never zero, which makes extension to a multibyte encoding free. So using one byte per prime, differences up to 512 can be stored in one byte, which gets us up to 303371455241 according to that Wikipedia article.

Therefore, depending on the hard drive, a stored list of primes should be about equal in speed at verifying primality. If it can be stored in RAM (it's 203 MB, so subsequent runs will probably hit the disk cache), then the problem goes away entirely, as the FSB speed typically differs from the processor speed by a factor less than the FSB width in bytes — i.e., the FSB can transfer more than one prime per cycle. Then factor of improvement is the reduction in division operations, i.e. five times. This is borne out by the experimental results below.

Of course, then there is multithreading. Ranges of either primes or skip-filtered candidates can be assigned to different threads, making either approach embarrassingly parallel. There are no optimizations that don't involve increasing the number of parallel divider circuits, unless you somehow eliminate the modulo.

Here is such a program. It's templated so you could add bignums.

/*
 *  multibyte_sieve.cpp
 *  Generate a table of primes, and use it to factorize numbers.
 *
 *  Created by David Krauss on 10/12/10.
 *
 */

#include <cmath>
#include <bitset>
#include <limits>
#include <memory>
#include <fstream>
#include <sstream>
#include <iostream>
#include <iterator>
#include <stdint.h>
using namespace std;

char const primes_filename[] = "primes";
enum { encoding_base = (1<< numeric_limits< unsigned char >::digits) - 2 };

template< typename It >
unsigned decode_gap( It &stream ) {
    unsigned gap = static_cast< unsigned char >( * stream ++ );

    if ( gap ) return 2 * gap; // only this path is tested

    gap = ( decode_gap( stream )/2-1 ) * encoding_base; // deep recursion
    return gap + decode_gap( stream ); // shallow recursion
}

template< typename It >
void encode_gap( It &stream, uint32_t gap ) {
    unsigned len = 0, bytes[4];

    gap /= 2;
    do {
        bytes[ len ++ ] = gap % encoding_base;
        gap /= encoding_base;
    } while ( gap );

    while ( -- len ) { // loop not tested
        * stream ++ = 0;
        * stream ++ = bytes[ len + 1 ];
    }
    * stream ++ = bytes[ 0 ];
}

template< size_t lim >
void generate_primes() {
    auto_ptr< bitset< lim / 2 > > sieve_p( new bitset< lim / 2 > );
    bitset< lim / 2 > &sieve = * sieve_p;

    ofstream out_f( primes_filename, ios::out | ios::binary );
    ostreambuf_iterator< char > out( out_f );

    size_t count = 0;

    size_t last = sqrtl( lim ) / 2 + 1, prev = 0, x = 1;
    for ( ; x != last; ++ x ) {
        if ( sieve[ x ] ) continue;
        size_t n = x * 2 + 1; // translate index to number
        for ( size_t m = x + n; m < lim/2; m += n ) sieve[ m ] = true;
        encode_gap( out, ( x - prev ) * 2 );
        prev = x;
    }

    for ( ; x != lim / 2; ++ x ) {
        if ( sieve[ x ] ) continue;
        encode_gap( out, ( x - prev ) * 2 );
        prev = x;
    }

    cout << prev * 2 + 1 << endl;
}

template< typename I >
void factorize( I n ) {
    ifstream in_f( primes_filename, ios::in | ios::binary );
    if ( ! in_f ) {
        cerr << "Could not open primes file.\n"
                "Please generate it with 'g' command.\n";
        return;
    }

    while ( n % 2 == 0 ) {
        n /= 2;
        cout << "2 ";
    }
    unsigned long factor = 1;

    for ( istreambuf_iterator< char > in( in_f ), in_end; in != in_end; ) {
        factor += decode_gap( in );

        while ( n % factor == 0 ) {
            n /= factor;
            cout << factor << " ";
        }

        if ( n == 1 ) goto finish;
    }

    cout << n;
finish:
    cout << endl;
}

int main( int argc, char *argv[] ) {
    if ( argc != 2 ) goto print_help;

    unsigned long n;

    if ( argv[1][0] == 'g' ) {
        generate_primes< (1ul<< 32) >();
    } else if ( ( istringstream( argv[1] ) >> n ).rdstate() == ios::eofbit )
        factorize( n );
    } else goto print_help;

    return 0;

print_help:
    cerr << "Usage:\n\t" << argv[0] << " <number> -- factorize number.\n"
            "\t" << argv[0] << " g -- generate primes file in current directory.\n";
}

Performance on a 2.2 GHz MacBook Pro:

dkrauss$ time ./multibyte_sieve g
4294967291

real    2m8.845s
user    1m15.177s
sys    0m2.446s
dkrauss$ time ./multibyte_sieve 18446743721522234449
4294967231 4294967279 

real    0m5.405s
user    0m4.773s
sys 0m0.458s
dkrauss$ time ./mike 18446743721522234449
4294967231 4294967279
real    0m25.147s
user    0m24.170s
sys 0m0.096s
Potatoswatter
Well, the theory sure sounds interesting, but ultimately, let's see how it performs. So, when you fix the bug, please post your results here. I would especially be interested to hear how it handles very large primes (e.g., 18446744073709542713), and also how it handles large numbers consisting of only two large prime factors (e.g., 18446743721522234449).
Michael Goldshteyn
@Michael: See update. Now I need to try your program…
Potatoswatter
… and it takes 25 sec for me. Wow, you have a fast machine.
Potatoswatter
So the conclusion is the one we could expect: if you want to factor one big number once in a while, mike's code looks better, but if you are interested in factoring many numbers and disk space is not a problem, the sieve is way faster (five times faster for 20-digit numbers which is the product of two 10-digit prime numbers). Compare constant cost (generating the primes file) versus the cost of one factorization times the number of calls.
wok
@wok: Pretty much. The overhead of having the table is surprisingly low, though. 75 seconds to generate that table, excluding I/O but including some OS overhead. I haven't tried the prime generator on a regular array instead of the `streambuf_iterator`, but it should be a bit faster if you run it that way. Right now I'm trying an optimization so it won't churn through memory so much. The break-even point is already just 3 primes to verify, I might get it down to 2.
Potatoswatter
I wonder if testing against the sieve could be effectively parallelized.
Omnifarious
@Omni: Certainly. Like I said, just divide it into subranges and assign one to each thread. Each thread will also need a checkpoint to be able to begin in the middle. After joining the threads, you'll need to merge their primes and get the final quotient. But all that should be easy.
Potatoswatter
I will let this question sit for another few days. If nobody else comes up with a significant improvement that doesn't involve a look up table, or perhaps a better lookup table variant, I will mark this answer as accepted, unless I find deficiencies, time-performance wise, with this method in that time.
Michael Goldshteyn
@Michael: I'm not sure SO handles multiple answers from the same person well, so make sure you check out my new program down at the bottom of the page. It requires SSE2 but doesn't put anything on the heap.
Potatoswatter
+1  A: 

I'm not sure how effective these would be, but instead of

while (workingNum%2==0)

you could do

while (workingNum & 1 == 0)

I'm not sure if gcc or msvc (or whatever compiler you are using) is clever enough to change the workingNum%2 expression, but odds are it's doing the division and looking at the modulus in edx...

My next suggestion could be completely unnecessary depending on your compiler, but you could try putting the workingNum /= 3 before the method call. G++ might be smart enough to see the unnecessary division and just use the quotient in eax (you could do this inside your larger loop too). Or, a more thorough (but painful) approach would be to inline assembly the following code.

while (workingNum%3==0)
{
  factors.push_back(3);
  workingNum/=3;
}

The compiler is probably translating the modular operation into a divison and then looking at the modulus in edx. The issue is, you are performing the division again (and I doubt the compiler is seeing that you just implicitly performed the division in the condition of the loop). So, you could inline assemble this. This presents two issues:

1) The method call for push_back(3). This might mess with the registers, making this completely unnecessary.

2) Getting a register for workingNum, but this can be determined by doing an initial modular check (to force it into %eax), or at the current moment, it will/should be in eax.

You could write the loop as (assuming workingNum is in eax, and this is 32bit AT&T syntax, only because I don't know 64bit assembly or Intel syntax)

asm( "
     movl     $3, %ebx
  WorkNumMod3Loop: 
     movl     %eax, %ecx    # just to be safe, backup workingNUm
     movl     $0, %edx      # zero out edx
     idivl    $3            # divide by 3. quotient in eax, remainder in edx
     cmpl     $0, %edx      # compare if it's 0
     jne      AfterNumMod3Loop    # if 0 is the remainder, jump out

     # no need to perform division because new workingNum is already in eax
     #factors.push_back(3) call

     je       WorkNumMod3Loop
  AfterNumMod3Loop: 
     movl     %ecx, %eax"
);

You should look at the assembly output for those loops. It is possible your compiler is already making these optimizations, but I doubt it. I would not be surprised if putting the workingNum /= n before the method call improves performance a little in some cases.

Kizaru
+5  A: 

My other answer is rather long and quite different from this one, so here's something else.

Rather than just filter out multiples of the first two primes, or encode all relevant primes in one byte each, this program filters out multiples of all of the primes that fit in eight bits, specifically 2 through 211. So instead of passing 33% of numbers, this passes about 10% on to the division operator.

The primes are kept in three SSE registers, and their moduli with the running counter are kept in another three. If the modulus of any prime with the counter is equal to zero, the counter cannot be prime. Also, if any modulus is equal to one, then (counter+2) cannot be prime, etc, up to (counter+30). Even numbers are ignored, and offsets like +3, +6, and +5 are skipped. Vector processing allows updating sixteen byte-sized variables at once.

After throwing a kitchen sink full o' micro-optimizations at it (but nothing more platform specific than an inline directive), I got a 1.78x performance boost (24 s vs 13.4 s on my laptop). If using a bignum library (even a very fast one), the advantage is greater. See below for a more readable, pre-optimization version.

/*
 *  factorize_sse.cpp
 *  Filter out multiples of the first 47 primes while factorizing a number.
 *
 *  Created by David Krauss on 10/14/10.
 *
 */

#include <cmath>
#include <sstream>
#include <iostream>
#include <xmmintrin.h>
using namespace std;

inline void remove_factor( unsigned long &n, unsigned long factor ) __attribute__((always_inline));
void remove_factor( unsigned long &n, unsigned long factor ) {
    while ( n % factor == 0 ) {
        n /= factor;
        cout << factor << " ";
    }
}

int main( int argc, char *argv[] ) {
    unsigned long n;

    if ( argc != 2
        || ( istringstream( argv[1] ) >> n >> ws ).rdstate() != ios::eofbit ) {
        cerr << "Usage: " << argv[0] << " <number>\n";
        return 1;
    }

    int primes[] = { 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47,
                     53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 127,
                     131, 137, 139, 149, 151, 157, 163, 167, 173, 179, 181, 191, 193, 197, 199, 211 };
    for ( int *p = primes; p < primes + sizeof primes/sizeof *primes; ++ p ) {
        remove_factor( n, * p );
    }

    //int histo[8] = {}, total = 0;

    enum { bias = 15 - 128 };
    __m128i const prime1 =       _mm_set_epi8( 21, 21, 21, 22, 22, 26, 26, 17, 19, 23, 29, 31, 37, 41, 43, 47 ),
            prime2 =             _mm_set_epi8( 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 127 ),
            prime3 =             _mm_set_epi8( 131, 137, 139, 149, 151, 157, 163, 167, 173, 179, 181, 191, 193, 197, 199, 211 ),
            vbias = _mm_set1_epi8( bias ),
            v3 = _mm_set1_epi8( 3+bias ), v5 = _mm_set1_epi8( 5+bias ), v6 = _mm_set1_epi8( 6+bias ), v8 = _mm_set1_epi8( 8+bias ),
            v9 = _mm_set1_epi8( 9+bias ), v11 = _mm_set1_epi8( 11+bias ), v14 = _mm_set1_epi8( 14+bias ), v15 = _mm_set1_epi8( 15+bias );
    __m128i mod1 = _mm_add_epi8( _mm_set_epi8(  3, 10, 17,  5, 16,  6, 19,  8,  9, 11, 14, 15, 18, 20, 21, 23 ), vbias ),
            mod2 = _mm_add_epi8( _mm_set_epi8( 26, 29, 30, 33, 35, 36, 39, 41, 44, 48,  50,  51,  53,  54,  56,  63 ), vbias ),
            mod3 = _mm_add_epi8( _mm_set_epi8(  65,  68,  69,  74,  75,  78,  81,  83,  86,  89,  90,  95,  96,  98,  99, 105 ), vbias );

    for ( unsigned long factor = 1, limit = sqrtl( n ); factor <= limit + 30; factor += 30 ) {
        if ( n == 1 ) goto done;

        // up to 2^32, distribution of number candidates produced (0 up to 7) is
        // 0.010841     0.0785208   0.222928    0.31905     0.246109    0.101023    0.0200728   0.00145546 
        unsigned candidates[8], *cand_pen = candidates;
        * cand_pen = 6;
        cand_pen += !( _mm_movemask_epi8( _mm_cmpeq_epi8( mod1,  v3 ) ) | _mm_movemask_epi8( _mm_or_si128( _mm_cmpeq_epi8( mod2,  v3 ), _mm_cmpeq_epi8( mod3,  v3 ) ) ) );
        * cand_pen = 10;                                                                                                                                            
        cand_pen += !( _mm_movemask_epi8( _mm_cmpeq_epi8( mod1,  v5 ) ) | _mm_movemask_epi8( _mm_or_si128( _mm_cmpeq_epi8( mod2,  v5 ), _mm_cmpeq_epi8( mod3,  v5 ) ) ) );
        * cand_pen = 12;                                                                                                                            
        cand_pen += !( _mm_movemask_epi8( _mm_cmpeq_epi8( mod1,  v6 ) ) | _mm_movemask_epi8( _mm_or_si128( _mm_cmpeq_epi8( mod2,  v6 ), _mm_cmpeq_epi8( mod3,  v6 ) ) ) );
        * cand_pen = 16;                                                                                                                            
        cand_pen += !( _mm_movemask_epi8( _mm_cmpeq_epi8( mod1,  v8 ) ) | _mm_movemask_epi8( _mm_or_si128( _mm_cmpeq_epi8( mod2,  v8 ), _mm_cmpeq_epi8( mod3,  v8 ) ) ) );
        * cand_pen = 18;                                                                                                                            
        cand_pen += !( _mm_movemask_epi8( _mm_cmpeq_epi8( mod1,  v9 ) ) | _mm_movemask_epi8( _mm_or_si128( _mm_cmpeq_epi8( mod2,  v9 ), _mm_cmpeq_epi8( mod3,  v9 ) ) ) );
        * cand_pen = 22;                                                                                                                            
        cand_pen += !( _mm_movemask_epi8( _mm_cmpeq_epi8( mod1, v11 ) ) | _mm_movemask_epi8( _mm_or_si128( _mm_cmpeq_epi8( mod2, v11 ), _mm_cmpeq_epi8( mod3, v11 ) ) ) );
        * cand_pen = 28;                                                                                                                            
        cand_pen += !( _mm_movemask_epi8( _mm_cmpeq_epi8( mod1, v14 ) ) | _mm_movemask_epi8( _mm_or_si128( _mm_cmpeq_epi8( mod2, v14 ), _mm_cmpeq_epi8( mod3, v14 ) ) ) );
        * cand_pen = 30;                                                                                                                            
        cand_pen += !( _mm_movemask_epi8( _mm_cmpeq_epi8( mod1, v15 ) ) | _mm_movemask_epi8( _mm_or_si128( _mm_cmpeq_epi8( mod2, v15 ), _mm_cmpeq_epi8( mod3, v15 ) ) ) );

        /*++ total;
        ++ histo[ cand_pen - candidates ];

        cout << "( ";
        while ( cand_pen != candidates ) cout << factor + * -- cand_pen << " ";
        cout << ")" << endl; */

        mod1 = _mm_sub_epi8( mod1, _mm_set1_epi8( 15 ) ); // update residuals
        __m128i mask1 = _mm_cmplt_epi8( mod1, _mm_set1_epi8( 1+bias ) );
        mask1 = _mm_and_si128( mask1, prime1 ); // residual goes to zero or negative?
        mod1 = _mm_add_epi8( mask1, mod1 ); // combine reset into zero or negative

        mod2 = _mm_sub_epi8( mod2, _mm_set1_epi8( 15 ) );
        __m128i mask2 = _mm_cmplt_epi8( mod2, _mm_set1_epi8( 1+bias ) );
        mask2 = _mm_and_si128( mask2, prime2 );
        mod2 = _mm_add_epi8( mask2, mod2 );

        mod3 = _mm_sub_epi8( mod3, _mm_set1_epi8( 15 ) );
        __m128i mask3 = _mm_cmplt_epi8( mod3, _mm_set1_epi8( 1+bias ) );
        mask3 = _mm_and_si128( mask3, prime3 );
        mod3 = _mm_add_epi8( mask3, mod3 );

        if ( cand_pen - candidates == 0 ) continue;
        remove_factor( n, factor + candidates[ 0 ] );
        if ( cand_pen - candidates == 1 ) continue;
        remove_factor( n, factor + candidates[ 1 ] );
        if ( cand_pen - candidates == 2 ) continue;
        remove_factor( n, factor + candidates[ 2 ] );
        if ( cand_pen - candidates == 3 ) continue;
        remove_factor( n, factor + candidates[ 3 ] );
        if ( cand_pen - candidates == 4 ) continue;
        remove_factor( n, factor + candidates[ 4 ] );
        if ( cand_pen - candidates == 5 ) continue;
        remove_factor( n, factor + candidates[ 5 ] );
        if ( cand_pen - candidates == 6 ) continue;
        remove_factor( n, factor + candidates[ 6 ] );
    }

    cout << n;
done:
    /*cout << endl;
    for ( int hx = 0; hx < 8; ++ hx ) cout << (float) histo[hx] / total << " ";*/
    cout << endl;
}

.

dkrauss$ /usr/local/bin/g++ main.cpp -o factorize_sse -O3 --profile-use
dkrauss$ time ./factorize_sse 18446743721522234449
4294967231 4294967279 

real    0m13.437s
user    0m13.393s
sys 0m0.011s

Below is the first draft of the above. Optimizations included

  • Make the cyclic counter merge unconditional (avoid a branch).
  • Obtain ILP by unrolling loop by a factor of 15, increasing stride to 30.
    • Inspired by your optimization.
    • 30 seems to be a sweet spot, as it removes multiples of 2, 3, and 5 for free.
    • Primes between 5 and 15 may have several multiples in one stride, so put several copies at varied phase in the vector.
  • Factor out remove_factor.
  • Change conditional, unpredictable remove_factor calls to non-branching array writes.
  • Fully unroll final loop with remove_factor call and make sure the function is always inlined.
    • Eliminate the final unrolled iteration as there is always a multiple of 7 among the candidates.
  • Add another vector containing all the remaining primes that are small enough.
  • Make more space by adding a bias to the counters, and add yet another vector. Now there are only six more primes that could possibly be filtered without bumping up to 16 bits, and I've also run out of registers: the loop needs 3 prime vectors, 3 modulus vectors, 8 constants to search for, and one constant each to increment by and to do the range check. That makes 16.
    • The gain is minimal (but positive) in this application, but the original purpose of this technique was to filter primes for the sieve in the other answer. So stay tuned…

Readable version:

/*
 *  factorize_sse.cpp
 *  Filter out multiples of the first 17 primes while factorizing a number.
 *
 *  Created by David Krauss on 10/14/10.
 *
 */

#include <cmath>
#include <sstream>
#include <iostream>
#include <xmmintrin.h>
using namespace std;

int main( int argc, char *argv[] ) {
    unsigned long n;

    if ( argc != 2
        || ( istringstream( argv[1] ) >> n >> ws ).rdstate() != ios::eofbit ) {
        cerr << "Usage: " << argv[0] << " <number>\n";
        return 1;
    }

    int primes[] = { 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59 };
    for ( int *p = primes; p < primes + sizeof primes/sizeof *primes; ++ p ) {
        while ( n % * p == 0 ) {
            n /= * p;
            cout << * p << " ";
        }
    }

    if ( n != 1 ) {
        __m128i       mod   = _mm_set_epi8( 1, 2, 3,  5,  6,  8,  9, 11, 14, 15, 18, 20, 21, 23, 26, 29 );
        __m128i const prime = _mm_set_epi8( 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59 ),
                      one = _mm_set1_epi8( 1 );

        for ( unsigned long factor = 1, limit = sqrtl( n ); factor < limit; ) {
            factor += 2;
            __m128i mask = _mm_cmpeq_epi8( mod, one ); // residual going to zero?
            mod = _mm_sub_epi8( mod, one ); // update other residuals
            if ( _mm_movemask_epi8( mask ) ) {
                mask = _mm_and_si128( mask, prime ); // reset cycle if going to zero
                mod = _mm_or_si128( mask, mod ); // combine reset into zeroed position

            } else while ( n % factor == 0 ) {
                n /= factor;
                cout << factor << " ";
                if ( n == 1 ) goto done;
            }
        }
        cout << n;
    }
done:
    cout << endl;
}
Potatoswatter
I don't think I'll give this particular version the checkmark, baring a revolutionary performance improvement, but I did give it a plus one vote for the extra effort and hope others do the same!
Michael Goldshteyn
@Michael: Updated with micro-optimizations. Now 25% faster… I'd put the upper bound on remaining possible gains around 10% compounded.
Potatoswatter
@Michael: Updated again. I underestimated the remaining improvement (positive jinx, I guess)… I've cut another 3 seconds off, which is 16% compound or 38% less time than your program. That is, close to twice as fast. Also PGO got me down to 14.5 sec once but I can't seem to reproduce it.
Potatoswatter
It's very possible that a stride of 42 (2*3*7), producing up to 10 candidates from a pool of 12, would be slightly faster, but I'll leave that as an exercise to the reader. (Hint: It could only filter out primes up to 37.)
Potatoswatter
@Michael: Updated again with more primes, cutting another 1.5 sec off the runtime. I haven't run it through the profiler, but I suspect there could be a further gain by balancing the OR operations between the integer and vector ALUs.
Potatoswatter
+3  A: 

Fermat's factorization method is simple and quick for finding pairs of large prime factors as long as you stop it before it goes too far and becomes slow. However, in my tests on random numbers such cases have been too rare to see any improvement.

...without using a probabalistic approach (e.g., Miller-Rabin) for determining primality

With uniform distribution, 75% of your inputs are going to need a billion loop iterations, so it's worth spending a million operations on less deterministic techniques first even if you get an inconclusive answer and have to revert back to trial division.

I've found the Brent variation of Pollard's Rho Method to be very good, though more complicated to code and understand. The best example I've seen is from this forum discussion. The method relies on luck, but helps often enough to be worthwhile.

The Miller-Rabin primality test is actually deterministic up to about 10^15, which can save you the trouble of a fruitless search.

I tried a few dozen variations and settled on the following for factoring int64 values:

  1. Trial division on small factors. (I use first 8000 precomputed primes.)
  2. 10 attempts with Pollard's Rho, each using 16 iterations
  3. Trial division to sqrt(n).

Note that Pollard's Rho finds factors that are not necessarily prime, so recursion can be used to factor those.

xan