ansaurus

Question

Fastest factorial implementation with 64-bit result in assembly.

Answer 1

+5 A:

It's been a while since I flexed my assembly muscles so I'll just offer some general advice.

Since you know in advance exactly how many and what size all the items will be, just make a contiguous array of values (either hard-coded or pre-calculated). After validating the input to the function (< 0 or > 12/20), you can then use a simple offset addressing to retrieve the appropriate value. This will work in O(1) time.

Cogwheel - Matthew Orlando 2010-07-08 19:22:26

Wait ... so, if I wanted to output a 256-bit result, which works for input `0` through `57`, then the approach should be similar, right? Just jump to the offset multiplied by 4.

Hamish Grubijan 2010-07-08 19:27:38

@Hamish: Yes, that's right. Though I'd define a structure 256 bits long and have pointer arithmetic take care of the multiplication for you.

Billy ONeal 2010-07-08 19:31:45

Can assembly even beat C++?

Hamish Grubijan 2010-07-08 19:40:15

You could also use a std::map<int,SomeBigNumberClassOrType> to `c++`-ify the whole thing

rubenvb 2010-07-08 19:41:37

@ruben, I would love to see an implementation. Non-assembly entries will count. Although ... if map is a hash table, then this is not the fastest approach.

Hamish Grubijan 2010-07-08 20:28:29

@Hamish: Re: assembly beating C++, Probably not. Even if you're a bit careless with the C++ code, the compiler will probably emit something extremely similar to what you'd write yourself (if not better).

Cogwheel - Matthew Orlando 2010-07-08 20:29:18

@Hamish: A map is a binary search tree. The constant array is the best idea, although realistically, it's unlikely that inline assembler will be faster than C++. It will probably be slower, since the compiler can't optimize it for each specific call site.

DeadMG 2010-07-08 21:16:56

Answer 2

A:

Who says that your assembly version is going to be any faster than the C++ version anyway. In fact, who says it will even match in speed? I'd bet $100 you never even manage to make it as fast as the compiler does.

Noah Roberts 2010-07-08 19:41:47

Ok ... you could start with a C++ implementation, and then someone will take you up on that $100 bet. I did not down-vote you, btw.

Hamish Grubijan 2010-07-08 20:27:27

Answer 3

A:

On popular demand, performancewise it's fabled to be a binary search, not a hashtable (std C++ doesn't have that I believe).

#include <map>

void main()
{
    std::map<int, BigIntThing> factMap;
    // insert all elements here, probably fancier ways to do this
    factMap.insert( 1 );
    factMap.insert( 1 );
    factMap.insert( 2 );
    // ....
    // to access, say 15!
    BigIntThing factMap[15]; // I think the index is right >_<
}

That's it. A std::map is ordered, so if your BigIntThing has a comparison operator all is good. There should be a way to get this const and/or static and/or global to get it compiled in the way you want.

rubenvb 2010-07-08 20:59:49

C++ has a hashmap since TR1, and there's one in Boost.

DeadMG 2010-07-08 21:17:52

Hash tables and binary searches are overkill here. You can just use a flat array, and index into that.

Thanatos 2010-07-08 23:12:59

Answer 4

+6 A:

I have cleverly built a lookup table in assembly. Just in case you're rusty on your assembly, The routine expects the parameter to be in the ecx register. I verify that it is in range then read the lookup table's values into the eax and edx registers. If the value is out of range, I just xor the eax and edx registers with themselves (this forces them to 0). Unfortunately, since it's an assembly routine the compiler will be unable to inline the code. But, I'm sure the few cycles I saved by writing my awesome assembly routine will more than make up for any gain by inlining.

factorial:
    xorl    %eax, %eax
    xorl    %edx, %edx
    cmpl    $20, %ecx
    ja  .TOOBIG
    movl    CSWTCH.1(,%ecx,8), %eax
    movl    CSWTCH.1+4(,%ecx,8), %edx
.TOOBIG:

LOOKUP_TABLE:
    .section    .rodata
    .align 32
    .type   CSWTCH.1, @object
    .size   CSWTCH.1, 168
CSWTCH.1:
    .long   1
    .long   0
    .long   1
    .long   0
    .long   2
    .long   0
    .long   6
    .long   0
    .long   24
    .long   0
    .long   120
    .long   0
    .long   720
    .long   0
    .long   5040
    .long   0
    .long   40320
    .long   0
    .long   362880
    .long   0
    .long   3628800
    .long   0
    .long   39916800
    .long   0
    .long   479001600
    .long   0
    .long   1932053504
    .long   1
    .long   1278945280
    .long   20
    .long   2004310016
    .long   304
    .long   2004189184
    .long   4871
    .long   -288522240
    .long   82814
    .long   -898433024
    .long   1490668
    .long   109641728
    .long   28322707
    .long   -2102132736
    .long   566454140

The lookup table is difficult to maintain, so I've included the script I used to build it

typedef unsigned long long uint64_t;

template< uint64_t N >
struct meta_factorial
{
   static const uint64_t value = N * meta_factorial<N-1>::value;
};

template<>
struct meta_factorial<0>
{
   static const uint64_t value = 1;
};

extern "C" uint64_t factorial( unsigned n );

uint64_t factorial( unsigned n )
{
   switch( n )
   {
      case 20: return meta_factorial<20>::value;
      case 19: return meta_factorial<19>::value;
      case 18: return meta_factorial<18>::value;
      case 17: return meta_factorial<17>::value;
      case 16: return meta_factorial<16>::value;
      case 15: return meta_factorial<15>::value;
      case 14: return meta_factorial<14>::value;
      case 13: return meta_factorial<13>::value;
      case 12: return meta_factorial<12>::value;
      case 11: return meta_factorial<11>::value;
      case 10: return meta_factorial<10>::value;
      case 9: return meta_factorial<9>::value;
      case 8: return meta_factorial<8>::value;
      case 7: return meta_factorial<7>::value;
      case 6: return meta_factorial<6>::value;
      case 5: return meta_factorial<5>::value;
      case 4: return meta_factorial<4>::value;
      case 3: return meta_factorial<3>::value;
      case 2: return meta_factorial<2>::value;
      case 1: return meta_factorial<1>::value;
      case 0: return meta_factorial<0>::value;
      default: return 0;
   }
}

Just in case you missed it in my poor attempt at humor. The C++ compiler is more than capable at correctly optimizing your code. As you can see I didn't need to do anything fancy with lookup tables, binary search trees, or hashes. Just a simple switch statement and the compiler did the rest.

caspin 2010-07-08 22:22:45

I don't think it's quite obvious that you're using templates to force the compiler to compute the lookup table at compile time.

Gabe 2010-07-08 23:18:25

+1 for the lols

Cogwheel - Matthew Orlando 2010-07-08 23:57:55

@Gabe: the compiler would generate the same code if I placed literal values directly in the switch rather than compute them at compile time. I didn't want to get into the details of meta programming. I figure if Hamish is uncomfortable with them he can just compute the factorials by hand.

caspin 2010-07-09 03:52:02

Caspin: Yes; I said is wasn't obvious because after you posted your code, somebody (http://stackoverflow.com/questions/3207094/fastest-factorial-implementation-with-64-bit-result-in-assembly/3208703#3208703) wrote an answer that says "You could probably use templates to build the array too." Either he didn't see your answer, or he saw it and didn't realize what you were doing.

Gabe 2010-07-09 04:57:50

Your assembler seems to lack a `ret`...

Thanatos 2010-07-09 13:59:34

I stripped the function call stuff from the assembly, it doesn't manipulate the stack either. If I were a bit more ambitious, I would embed the assembly in as `asm()` statement, then `ret` wouldn't make sense. As it is it gets the point across, just use C++ and let the optimizer write the assembly for you.

caspin 2010-07-09 15:41:50

Answer 5

A:

If you're only working with numbers from 0-19, a hash table or binary tree is overkill. Just create an unsigned int[20] and then query the index:

const unsigned int FACTORIALS[20] = {1,1,2,6,24,120,etc..};

unsigned int factorial(unsigned int num) {
    if(num >= 0 && num <= 19) {
        return FACTORIALS[num];
    }
    else {
        throw // some sort of exception
    }
}

You could probably use templates to build the array too.

Brendan Long 2010-07-08 23:11:11

Answer 6

A:

gcc's Answer

...which probably beat's yours, was compiled from:

uint64_t answers[] = {
    1ULL,
    1ULL,
    2ULL,
    6ULL,
    24ULL,
    ...
    2432902008176640000ULL,
};

uint64_t factorial(unsigned int i) {
    if(i >= sizeof(answers) / sizeof(*answers))
        return 0;
    else
        return answers[i];
}

...and the assembly...

factorial:
    cmpl    $20, %edi
    movl    $0, %eax
    ja  .L3
    movslq  %edi,%eax
    movq    answers(,%rax,8), %rax
.L3:
    rep
    ret
answers:
    .quad 1
    .quad 1
    ...

...which seems to be the first 64-bit assembler answer up...

Thanatos 2010-07-08 23:12:11

Why, exactly, do you think that 3 is a better error result than 0?

Gabe 2010-07-08 23:22:30

0 is the valid answer of 0!, and thus isn't all that suitable for an error code. 3 is not a valid answer of x! for any x, thus, is available. (1 is 1!, 2 is 2!, 3 is the first such number.)

Thanatos 2010-07-08 23:49:29

Hamish Grubijan 2010-07-09 03:39:29

Wow, good catch, even when I am insisting on being wrong. I apparently have forgotten factorials.

Thanatos 2010-07-09 04:45:49

Edited, hopefully less dain-bramaged now.

Thanatos 2010-07-09 04:47:52

I thought about using 64 bit assembler in my answer, but the 32 bit was much less convoluted. The 32 bit compiler produced a lookup table where the 64 bit compiler produced a jump table.

caspin 2010-07-09 07:19:31

The reason why `0! = 1` is this: `n! = n * (n - 1)!`, so `3! = 3 * 2!`, `2! = 2 * 1!`, and `1! = 1 * 0!`. In other words `0! = 1!/1 = 1`. Now ... they (mathematicians) could have defined base case as: `1! = 1` and claim that `0!` does not exist, but having the value of `0! = 1` also comes very handy in combinatorics, for the reason I once have seen but since forgot. I think it has to do with picking socks out of the box while blindfolded ...

Hamish Grubijan 2010-07-09 14:38:33

ansaurus

tags:

views:

answers:

Fastest factorial implementation with 64-bit result in assembly.

gcc's Answer

related questions