This is not homework, just something I though of. So, straight computing factorial is not exactly fast; memoization can help, but if the result is to fit into 32 or 64 bits, then the factorial only can work for inputs 0
through 12
and 20
respectively. So ... we might as well use a lookup table:
n n!
0 1
1 1
2 2
3 6
4 24
5 120
6 720
7 5040
8 40320
9 362880
10 3628800
11 39916800
12 479001600
13 6227020800 2^32= 4294967296
14 87178291200
15 1.30767E+12
16 2.09228E+13
17 3.55687E+14
18 6.40237E+15
19 1.21645E+17
20 2.4329E+18
2^64= 1.84467E+19
So, suppose I want to have an inline C++ factorial function which uses inline assembly, with a 32 bit or 64 bit unsigned integer expected as a result. If the input is either negative or large enough to cause overflow, the output should be 0. How can this be done in assembly such that it consumes the least amount of cycles? This code will run on a 64-bit Intel/AMD architecture. If feasible, I am interested in improving the worst case scenario, so 20!
should not take a lot longer to compute than 0!
- hopefully there is a binary search approach. Hopefully there is a clever trick for doing if (n == 0 || n == 1) { return 1; }
. Also, if the output needs to be 32 bit, then I think assembly instructions can contain both code and data in them. My assembly knowledge is weak. Please let me know if the question does not make a whole lot of sense.
Being able to use the function in C++ would be nice - makes it a more realistic problem. If, for instance, calling a function is expensive, then trying to save 1-2 clock cycles in the body of the assembly will not help much.