ansaurus

Question

GCC inline assembler, mixing register sizes (x86)

Answer 1

+12 A:

You can use %w0 if I remember right. I just tested it, too. :-)

int
test(int x)
{
    int y;
    asm ("rorw $8, %w0" : "=q" (y) : "0" (x));
    return y;
}

Edit: In response to the OP, yes, you can do the following too:

int
test(int x)
{
    int y;
    asm ("xchg %b0, %h0" : "=Q" (y) : "0" (x));
    return y;
}

At present, the only place (that I know of) it's documented in is gcc/config/i386/i386.md, not in any of the standard documentation.

Chris Jester-Young 2008-09-23 02:01:58

I tested as well.. Do you know the modifiers for the low and high bytes as well?

Nils Pipenbrinck 2008-09-23 02:06:00

Thanks, I'm glad it helped!

Chris Jester-Young 2008-09-23 11:03:27

Answer 2

A:

So apparently there are tricks to do this... but it may not be so efficient. 32-bit x86 processors are generally slow at manipulating 16-bit data in general purpose registers. You ought to benchmark it if performance is important.

Unless this is (a) performance critical and (b) proves to be much faster, I would save myself some maintenance hassle and just do it in C:

uint32_t y, hi=(x&~0xffff), lo=(x&0xffff);
y = hi + (((lo >> 8) + (lo << 8))&0xffff);

With GCC 4.2 and -O2 this gets optimized down to six instructions...

Dan 2008-09-23 06:20:31

How is 6 instructions supposed to be faster than 1 instruction?! My timing tests (for a billion runs, 5 trials) were: my version = (4.38, 4.48, 5.03, 4.10, 4.18), your version = (5.33, 6.21, 5.62, 5.32, 5.29).

Chris Jester-Young 2008-09-23 11:21:39

So, we're looking at a 20% speed improvement. Isn't that "much faster"?

Chris Jester-Young 2008-09-23 11:23:01

Chris, absolutely right... your version *is* faster it seems. But not nearly as much as 6-instructions-vs.-1-instruction would lead you to expect, and that's what I was warning about.I didn't actually do the comparison myself, so props to you for testing it!!

Dan 2008-09-23 16:38:02

Answer 3

+1 A:

@Dan,

I need that lower byte swapping primitive for a larger tweak.

I know that 16 bit operations in 32 bit code have been slow and frowned upon, but the code will be surrounded with other 32 bit operations. I hope that the slowness of the 16 bit code will just get lost in the out of order scheduling.

What I want to archive in the end is a mechansim to do all 24 possible byte permutation of a dword in-place. For this you need only three instructions at most: low-byte swap (e.g. xchg al, ah), bswap and 32 bit rotates.

The in-place way does not need any constants (faster code fetch / decode time) and only uses a single register. For x86/32 that may save me up to 6 costly memory-accesses (push/pop) ontop of the ca. 10 instructions I save for byte shuffling.

First tests have shown that such a code can run up to three times faster on my core2, but I have to make more measurements on other machines before I can use it.

My secret plan is to integrate this tweak into GCC one day, but that may not ever happen because GCC is such a huge codebase.

Nils Pipenbrinck 2008-09-23 12:01:03

Answer 4

A:

@Nils,

Gotcha. Well if it's a primitive routine that you're going to be reusing over and over, I have no argument with it... the register naming trick that Chris pointed out is a nice one that I'm going to have to remember.

It would be nice if it made it into the standard GCC docs too!

Dan 2008-09-23 16:41:26

@Dan,I checked the GCC documentation twice and then filed a bug report because this info is missing. Who knows - maybe it makes it into the next release.

Nils Pipenbrinck 2008-09-23 16:44:57

I found the bug at http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37621, and it looks like there may be resistance to documenting this feature since it's only meant for internal use. Hrm...

Dan 2008-09-24 17:35:58

Answer 5

+1 A:

While I'm thinking about it ... you should replace the "q" constraint with a capital "Q" constraint in Chris's second solution:

int
test(int x)
{
    int y;
    asm ("xchg %b0, %h0" : "=Q" (y) : "0" (x));
    return y;
}

"q" and "Q" are slightly different in 64-bit mode, where you can get the lowest byte for all of the integer registers (ax, bx, cx, dx, si, di, sp, bp, r8-r15). But you can only get the second-lowest byte (e.g. ah) for the four original 386 registers (ax, bx, cx, dx).

Dan 2008-09-23 17:12:58

Yes, good point, thank you! I'll edit my post now. :-)

Chris Jester-Young 2008-09-24 03:56:33

ansaurus

tags:

views:

answers:

GCC inline assembler, mixing register sizes (x86)

related questions