+6  A: 

Without testing, I'm sympathetic to his argument. You can do it in 4 multiplications, as compared to sorting, which is n log n. Specifically, the optimal sorting network requires 9 comparisons. The evaluator then has to at least look at every element of the sorted array, which is another 5 operations.

Matthew Flaschen
Big O complexity is utterly irrelevant when you're talking about a fixed n=5
Michael Borgwardt
While the question text is specific to 5, the title asks about small arrays in general. Anyway, I put in the exact number of comparisons required for this example.
Matthew Flaschen
Allocating 4 ints on the stack can be done by incrementing a pointer. That hardly seems likely to be the bottleneck.
Matthew Flaschen
+1  A: 

That shouldn't really be relevant, but he is correct. Sorting takes much longer than multiplying.

The real question is what he did with the resulting prime number, and how that was helpful (since factoring it I would expect to take longer than sorting.

zebediah49
If I remember correctly, he used it as a lookup into a gigantic table.
AShelly
He's mapping each of the 2,598,960 unique five-card poker hands to one of only 7462 distinct values. Kind of like an [equivalence relation](http://en.wikipedia.org/wiki/Equivalence_relation).
Rudiger
Asymptotically, sorting takes much longer than multiplying. This is pretty much the opposite of asymptotic performance!
Rudiger
For example, see http://stackoverflow.com/questions/1276716/fast-stable-sort-for-small-arrays-under-32-or-64-elements or http://stackoverflow.com/questions/2748749/fast-algorithm-implementation-to-sort-very-small-set
Rudiger
There are only 134,459 unique hands when you ignore suit. That's small enough to fit into a look-up table. It's even smaller if you just consider the numbers that result in a hand with a pair or better.
Niki Yoshiuchi
+1  A: 

It's hard to think of any sorting operation that could be faster than multiplying the same set of numbers. At the processor level, the multiplication is just load, load, multiply, load, multiply, ..., with maybe some manipulation of the accumulator thrown in. It's linear, easily pipelined, no comparisons with the associated branch mis-prediction costs. It should average about 2 instructions per value to be multiplied. Unless the multiply instruction is painfully slow, it's really hard to imagine a faster sort.

AShelly
+1 the memory access pattern of a sort (and mov/load will generally be far slower than multiply) is unlikely to be as fast as just multiplying 5 integrals together.
+6  A: 

Of course it depends a lot on the CPU of your computer, but a typical Intel CPU (e.g. Core 2 Duo) can multiply two 32 Bit numbers within 3 CPU clock cycles. For a sort algorithm to beat that, the algorithm needs to be faster than 3 * 4 = 12 CPU cycles, which is a very tight constraint. None of the standard sorting algorithms can do it in less than 12 cycles for sure. Alone the comparison of two numbers will take one CPU cycle, the conditional branch on the result will also take one CPU cycle and whatever you do then will at least take one CPU cycle (swapping two cards will actually take at least 4 CPU cycles). So multiplying wins.

Of course this is not taking the latency into account to fetch the card value from either 1st or 2nd level cache or maybe even memory; however, this latency applies to either case, multiplying and sorting.

Mecki
He needs to fetch the card value and do a lookup into a list of 13 primes. Still faster than sorting.
phkahler
Considering the fact, that the 13 primes are most likely in 1st level cache after one or two lookups, this will add another 3 clock cycles for the lookup of each card. But you are right, this is still definitely faster.
Mecki
While the point is good, your cycle counting seems to be a bit off. For the multiply case, one of the multiplies can be pipelined, so latency is 9 cycles. Conditional branches will on average take significantly more than 1 cycle due to mispredictions. CMOV's could turn control-flow into data-dependencies, but even accounting for pipelining, best case is 1 cycle for CMP + 5*2 cycles for CMOV's = 11 cycles for the optimal sorting network.
Ants Aasma
Yeah, I left pipeline out, because in reality its speculative what can really be pipelined (unless you have raw assembly code in front of you to look at). Regarding branching, conditional branching is only slow, if it mispredicts, but it might as well predict correctly ;-)
Mecki
+2  A: 

5 elements can be sorted using an optimized decision tree, which is much faster than using a general-purpose sorting algorithm.

However, the fact remains that sorting means lots of branches (as do the comparisons that are necessary afterwards). Branches are really bad for modern pipelined CPU architectures, especially branches that go either way with similar likelihood (thus defeating branch prediction logic). That, much more than the theoretical cost of multiplication vs. comparisons, makes multiplication faster.

But if you could build custom hardware to do the sorting, it might end up faster.

Michael Borgwardt
A specialized radix sort would be pretty fast and branchless.
Dolphin
+1  A: 

One thing worth mentioning is that even if your CPU's multiply instruction is dead slow (or nonexistent...) you can use a lookup table to speed things even further.

Roddy
The 7-card evaluator does exactly this to load a tree representing hand ranks for every single combination of 7 cards possible (about 130,000,000 combinations). It's super fast but takes several hundred megs of RAM and a load time to initialize.
+4  A: 

Sorting is not intrinsically harder than multiplying numbers. On paper, they're about the same, and you also need a sophisticated multiplication algorithm to make large multiplication competitive with large sort. Moreover, when the proposed multiplication algorithm is feasible, you can also use bucket sort, which is asymptotically faster.

However, a poker hand is not an asymptotic problem. It's just 5 cards and he only cares about one of the 13 number values of the card. Even if multiplication is complicated in principle, in practice it is implemented in microcode and it's incredibly fast. What he's doing works.

Now, if you're interested in the theoretical question, there is also a solution using addition rather than multiplication. There can only be 4 cards of any one value, so you could just as well assign the values 1,5,25,...,5^12 and add them. It still fits in 32-bit arithmetic. There are also other addition-based solutions with other mathematical properties. But it really doesn't matter, because microcoded arithmetic is so much faster than anything else that the computer is doing.

Greg Kuperberg
+1  A: 

After a lot of thought, I had a brainstorm to use prime numbers. I would assign a prime number value to each of the thirteen card ranks... The beauty of this system is that if you multiply the prime values of the rank of each card in your hand, you get a unique product, regardless of the order of the five cards.

That's a example of a non-positional number system.

I can't find the link to the theory. I studied that as part of applied algebra, somewhere around the Euler's totient and encryption. (I can be wrong with terminology as I have studied all that in my native language.)

What if we keep his representation (cards as 4-byte integers)? Can sorting an array of 5 integers be faster than multiplying them?

RAM is an external resource and is generally slower compared to the CPU. Sorting 5 of ints would always have to go to RAM due to swap operations. Add here the overhead of sorting function itself, and multiplication stops looking all that bad.

I think on modern CPUs integer multiplication would pretty much always faster than sorting, since several multiplications can be executed at the same time on different ALUs, while there is only one bus connecting CPU to RAM.

If not, what sort of low-level optimizations can be done to make sorting a small number of elements faster?

5 integers can be sorted quite quickly using bubble sort: qsort would use more memory (for recursion) while well optimized bubble sort would work completely from d-cache.

Dummy00001
"Sorting 5 of ints would always have to go to RAM due to swap operations." - what sort of CPU are you using, that only has 5 32-bit registers? Not to mention, the next step down from registers is L1 cache, not RAM.
Nick Johnson
@Nick, no compiler would try to cache an array with variable length in registers. Unless custom function with fixed array length is used. And, yes, I have slightly brushed over the details of memory organization. Going into cache-line-alignment/i$/d$/L1/L2/L3/TLB/branch-prediction/cache-write-through would over-complicate the explanation unnecessary. General rule holds: less the expensive resources are used, better. And memory is expensive. At very small array lengths, qsort()'s recursion memory overhead simply outweigh the benefits of dichotomy.
Dummy00001
Nobody said you had to use an array and a general sorting algorithm, though. In fact, several people have suggested sorting networks. And as I said, next level down is cache, not RAM.
Nick Johnson
A: 

As others have pointed out, sorting alone isn't quicker than multiplying for 5 values. This ignores, however, the rest of his solution. After disdaining a 5-element sort, he proceeds to do a binary search over an array of 4888 values - at least 12 comparisons, more than the sort ever required!

Note that I'm not saying there's a better solution that involves sorting - I haven't given it enough thought, personally - just that sorting alone is only part of the problem.

He also didn't have to use primes. If he simply encoded the value of each card in 4 bits, he'd need 20 bits to represent a hand, giving a range of 0 to 2^20 = 1048576, about 1/100th of the range produced using primes, and small enough (though still suffering cache coherency issues) to produce a lookup table over.

Of course, an even more interesting variant is to take 7 cards, such as are found in games like Texas Holdem, and find the best 5 card hand that can be made from them.

Nick Johnson
Binary search with a fixed number of steps does not require any comparisons. It can be done entirely with arithmetic and bitwise operations which yield the result. If the number of steps (size of the table to search) is not fixed then you need a loop/counter which is one (very predictable) conditional jump per iteration, but otherwise the same arithmetic approach still applies.
R..
@R Say what? You have to compare the value of the current node to the value you're searching for at each step. If you have a way to optimize away that, I'd love to hear it.
Nick Johnson
Well by comparisons I meant conditional jumps based on result of a comparison. If instead you do arithmetic (making use of sign extension) to update the current node at each step, there are no jumps, and it comes out to just several opcodes per step.
R..
The code is something like this. `for (a=0, n=(table_len+1)>>1; n>1; n=n+1>>1) a+=n a+=8 a+=4 a+=2` ...
R..
Fair enough. The arithmetic, especially after pipeline stalls because of dependencies (the load depends on the outcome of the arithmetic), it's still a substantial amount of work - comparable to the original sorting network he disdained. Throw in cache misses because of the large lookup table, and it's probably much worse.
Nick Johnson
A: 

The multiplication is faster.

Multiplication of any given array will always be faster than sorting the array, presuming the multiplication results in a meaningful result, and the lookup table is irrelevant because the code is designed to evaluate a poker hand so you'd need to do a lookup on the sorted set anyway.

Chris