tags:

views:

648

answers:

3

in gcc, i want to do a 128 bits xor with 2 C variables, via asm code: how?

asm (
    "movdqa %1, %%xmm1;"
    "movdqa %0, %%xmm0;"
     "pxor %%xmm1,%%xmm0;"
     "movdqa %%xmm0, %0;"

    :"=x"(buff) /* output operand */
    :"x"(bu), "x"(buff)
    :"%xmm0","%xmm1"
    );

but i have a Segmentation fault error; this is the objdump output:

movq -0x80(%rbp),%xmm2

movq -0x88(%rbp),%xmm3

movdqa %xmm2,%xmm1

movdqa %xmm2,%xmm0

pxor %xmm1,%xmm0

movdqa %xmm0,%xmm2

movq %xmm2,-0x78(%rbp)

+1  A: 

Umm, why not use the __builtin_ia32_pxor intrinsic?

ephemient
+7  A: 

You would see segfault issues if the variables aren't 16-byte aligned. The CPU can't MOVDQA to/from unaligned memory addresses, and would generate a processor-level "GP exception", prompting the OS to segfault your app.

C variables you declare (stack, global) or allocate on the heap aren't generally aligned to a 16 byte boundary, though occasionally you may get an aligned one by chance. You could direct the compiler to ensure proper alignment by using the __m128 or __m128i data types. Each of those declares a properly-aligned 128 bit value.

Further, reading the objdump, it looks like the compiler wrapped the asm sequence with code to copy the operands from the stack to the xmm2 and xmm3 registers using the MOVQ instruction, only to have your asm code then copy the values to xmm0 and xmm1. After xor-ing into xmm0, the wrapper copies the result to xmm2 only to then copy it back to the stack. Overall, not terribly efficient. MOVQ copies 8 bytes at a time, and expects (under some circumstances), an 8-byte aligned address. Getting an unaligned address, it could fail just like MOVDQA. The wrapper code, however, adds an aligned offset (-0x80, -0x88, and later -0x78) to the BP register, which may or may not contain an aligned value. Overall, there's no guaranty of alignment in the generated code.

The following ensures the arguments and result are stored in correctly aligned memory locations, and seems to work fine:

#include <stdio.h>
#include <emmintrin.h>

void print128(__m128i value) {
    int64_t *v64 = (int64_t*) &value;
    printf("%.16llx %.16llx\n", v64[1], v64[0]);
}

void main() {
    __m128i a = _mm_setr_epi32(0x00ffff00, 0x00ffff00, 0x00ffff00, 0x10ffff00), /* low dword first! */
            b = _mm_setr_epi32(0x0000ffff, 0x0000ffff, 0x0000ffff, 0x0000ffff),
            x;

    asm (
        "movdqa %1, %%xmm0;"      /* xmm0 <- a */
        "movdqa %2, %%xmm1;"      /* xmm1 <- b */
        "pxor %%xmm1, %%xmm0;"    /* xmm0 <- xmm0 xor xmm1 */
        "movdqa %%xmm0, %0;"      /* x <- xmm0 */

        :"=x"(x)          /* output operand, %0 */
        :"x"(a), "x"(b)   /* input operands, %1, %2 */
        :"%xmm0","%xmm1"  /* clobbered registers */
    );

    /* printf the arguments and result as 2 64-bit hex values */
    print128(a);
    print128(b);
    print128(x);
}

compile with (gcc, ubuntu 32 bit)

gcc -msse2 -o app app.c

output:

10ffff0000ffff00 00ffff0000ffff00
0000ffff0000ffff 0000ffff0000ffff
10ff00ff00ff00ff 00ff00ff00ff00ff

In the code above, _mm_setr_epi32 is used to initialize a and b with 128 bit values, as the compiler may not support 128 integer literals.

print128 writes out the hexadecimal representation of a 128 bit integer, as printf may not be able to do so.


The following is shorter and avoids some of the duplicate copying. The compiler adds its hidden wrapping movdqa's to make pxor %2,%0 magically work without you having to load the registers on your own:

#include <stdio.h>
#include <emmintrin.h>

void print128(__m128i value) {
    int64_t *px = (int64_t*) &value;
    printf("%.16llx %.16llx\n", px[1], px[0]);
}

void main() {
    __m128i a = _mm_setr_epi32(0x00ffff00, 0x00ffff00, 0x00ffff00, 0x10ffff00),
            b = _mm_setr_epi32(0x0000ffff, 0x0000ffff, 0x0000ffff, 0x0000ffff);

    asm (
        "pxor %2, %0;"    /* a <- b xor a  */

        :"=x"(a)          /* output operand, %0 */
        :"x"(a), "x"(b)   /* input operands, %1, %2 */
        );

    print128(a);
}

compile as before:

gcc -msse2 -o app app.c

output:

10ff00ff00ff00ff 00ff00ff00ff00ff

Alternatively, if you'd like to avoid the inline assembly, you could use the SSE intrinsics instead (PDF). Those are inlined functions/macros that encapsulate MMX/SSE instructions with a C-like syntax. _mm_xor_si128 reduces your task to a single call:

#include <stdio.h>
#include <emmintrin.h>

void print128(__m128i value) {
    int64_t *v64 = (int64_t*) &value;
    printf("%.16llx %.16llx\n", v64[1], v64[0]);
}

void main()
{
    __m128i x = _mm_xor_si128(
        _mm_setr_epi32(0x00ffff00, 0x00ffff00, 0x00ffff00, 0x10ffff00), /* low dword first !*/
        _mm_setr_epi32(0x0000ffff, 0x0000ffff, 0x0000ffff, 0x0000ffff));

    print128(x);
}

compile:

gcc -msse2 -o app app.c

output:

10ff00ff00ff00ff 00ff00ff00ff00ff
Oren Trutner
Roberto, to answer your question on how long it is: To xor aligned 128 bit values you already have, just call _mm_xor_si128. It's in emmintrin.h, and is documented in the intel PDF listed in the answer. In assembly, it's a single opcode (pxor) that works on the 128 bit xmm registers. It starts adding value if the rest of your code also deals with 128 bit values; otherwise you'd just be pushing values to/from the xmm regs. For 64 bit values and less, I assume you know you can just use the C xor operator, ^
Oren Trutner
Also, reading the objdump, it looks like the generated code loads the values into xmm2/3 only to then copy them into xmm0/1, do the xor, and then copy the result back to xmm2 before copying it to memory. Overall, not very efficient. I would suggest (1) turning on optimizations, and (2) consider writing the containing loop in asm as well, to avoid the redundant copies.
Oren Trutner
A: 

thanks, it works, even if now i do not know and understand all these C instructions. But, viewing the objdump output i note the long long way to perform my simple pxor. In fact it works in 16 bit mode before and after my four 128-bit instructions. No way to shorten it? thanks

roberto15