views:

296

answers:

3

Many CPUs have single assembly opcodes for returning the high order bits of a 32 bit integer multiplication. Normally multiplying two 32 bit integers produces a 64 bit result, but this is truncated to the low 32 bits if you store it in a 32 bit integer.

For example, on PowerPC, the mulhw opcode returns the high 32 bits of the 64 bit result of a 32x32 bit multiply in one clock. This is exactly what I'm looking for, but more portably. There's a similar opcode, umulhi(), in NVidia CUDA.

In C/C++, is there an efficient way to return the high order bits of the 32x32 multiply? Currently I compute it by casting to 64 bits, something like:

unsigned int umulhi32(unsigned int x, unsigned int y)
{
  unsigned long long xx=x;
  xx*=y;
  return (unsigned int)(xx>>32);
}

but this is over 11 times slower than a regular 32 by 32 multiply because I'm using overkill 64 bit math even for the multiply.

Is there a faster way to compute the high order bits?

This is clearly not best solved with a BigInteger library (which is overkill and will have huge overhead).

SSE seems to have PMULHUW, a 16x16 -> top 16 bit version of this, but not a 32x32 -> top 32 version like I'm looking for.

+2  A: 

I don't think there's a way to do this in standard C/C++ better than what you already have. What I'd do is write up a simple assembly wrapper that returned the result you want.

Not that you're asking about Windows, but as an example even though Windows has an API that sounds like it does what you want (a 32 by 32 bit multiply while obtaining the full 64 bit result), it implements the multiply as a macro that does what you're doing:

#define UInt32x32To64( a, b ) (ULONGLONG)((ULONGLONG)(DWORD)(a) * (DWORD)(b))
Michael Burr
A: 

On 32 bit intel, a multiply affects two registers for the output. That is, the 64 bits are fully available, whether you want it or not. Its just a function of whether the compiler is smart enough to take advantage of it.

Modern compilers do amazing things, so my suggestion is to experiment with optimization flags some more, at least on Intel. You would think that the optimizer might know that the processor produces a 64 bit value from 32 by 32 bits.

That said, at some point I tried to get the compiler to use the modulo as well as the dividend on a division result, but the old Microsoft compiler from 1998 was not smart enough to realize the same instruction produced both results.

Matthias Wandel
+9  A: 

gcc 4.3.2, with -O1 optimisation or higher, translated your function exactly as you showed it to IA32 assembly like this:

umulhi32:
        pushl   %ebp
        movl    %esp, %ebp
        movl    12(%ebp), %eax
        mull    8(%ebp)
        movl    %edx, %eax
        popl    %ebp
        ret

Which is just doing a single 32 bit mull and putting the high 32 bits of the result (from %edx) into the return value.

That's what you wanted, right? Sounds like you just need to turn up the optimisation on your compiler ;) It's possible you could push the compiler in the right direction by eliminating the intermediate variable:

unsigned int umulhi32(unsigned int x, unsigned int y)
{
  return (unsigned int)(((unsigned long long)x * y)>>32);
}
caf
Yes, pretty much any every compiler I've worked with will do this at -O2, if not at -O1.
Stephen Canon