ansaurus

Question

What's the most efficient way to multiply 4 floats by 4 floats using SSE ?

Answer 1

+1 A:

Does GCC provide support for the __m128 data type? If so that's your best plan for guaranteeing a 16 byte aligned data type. Nonetheless there is __attribute__((aligned(16))) for aligning things. Define your arrays as follows

float a[4] __attribute__((aligned(16))) = { 10, 20, 30, 40 };
float b[4] __attribute__((aligned(16))) = { 0.1, 0.1, 0.1, 0.1 };

and then use movaps instead :)

Goz 2009-08-04 12:38:58

wow those "__"s really screw up the formatting. Anyone know how to fix that?

Goz 2009-08-04 12:40:36

thanks; but as stated in this article http://stackoverflow.com/questions/841433/gcc-attributealignedx-explanation it seems impossible to align arrays that are allocated on the stack? (as opposed to global arrays allocated in .data)

banister 2009-08-04 12:44:11

@Goz, yes - use inline code blocks (backticks)

Dominic Rodger 2009-08-04 12:48:35

thanks for the fix Bastien :)Banister ... can you give it a try and see what happens? If that linked to explanation is right then it would be impossible to align things like double correctly, yet they DO get aligned.

Goz 2009-08-04 12:55:10

yes i will soon...I have a feeling the linked explanation is wrong, as everyone in this question seems to imply. thanks everyone! :)

banister 2009-08-04 12:58:16

Thanks Dominic :)

Goz 2009-08-04 13:11:59

@Goz, no problem! Bit bemused by @Bastien's edit, but never mind.

Dominic Rodger 2009-08-04 14:22:34

Answer 2

+1 A:

if i WAS to align the arrays on 16 byte boundaries, would it even work? Since the arrays are allocated on the stack is it true that aligning them is near impossible?

It is required that alignment on the stack works. Otherwise intrinsics would not work. I would guess the post you quoted had to do with the exorbitant value he selected for the alignment value.

to 2:

No, there shouldn't be a difference in performance. See this site for the instruction timings of several processors.

How alignment of stack variables works :

push ebp
mov ebp, esp
and esp, -16    ; fffffff0H
sub esp, 200    ; 000000c8H

The and aligns the begin of the stack to 16 byte.

Christopher 2009-08-04 12:44:25

Answer 3

+1 A:

(1) if i WAS to align the arrays on 16 byte boundaries, would it even work? Since the arrays are allocated on the stack is it true that aligning them is near impossible?

No, it's quite simple to align the stack pointer using and:

and esp, 0xFFFFFFF0 ; aligned on a 16-byte boundary

But you should use what GCC provides, such as a 16 bytes type, or __attribute__ to customize alignment.

Bastien Léonard 2009-08-04 12:45:16

thanks for your answer, would you be able to explain to me how you can use 'and' for alignment? i dont quite 'get' it :)

banister 2009-08-05 16:33:48

Recall that `some_bit and 0 = 0` and `a/16 = a>>4` if a is unsigned. Using `and` like this will set the four least significant bits to zero, and leave the others unchanged. What happens if you divide `esp` by 16, actually? It gets right-shifted by 4, and the four “lost” bits are the remainder. Thus those four bits should be 0, so that `esp` is divisible by 16. What really happens is that it subtracts *at most* 15, so that `esp` % 16 == 0. (Subtracting from `esp` means allocating more space on the stack).

Bastien Léonard 2009-08-05 16:56:02

Answer 4

+5 A:

Write it in C, use

gcc -S -mssse3

if you have a fairly recent version of gcc.

xcramps 2009-08-04 12:56:35

what C code would compile to those sse instructions? do you have an example?

banister 2009-08-04 13:01:12

float a[4] = { 10, 20, 30, 40 };float b[4] = { 0.1, 0.1, 0.1, 0.1 };intfoo(void) { int i; for (i=0; i < 4; i++) a[i] *= b[i];}Compile as shown and examine the .s file.

xcramps 2009-08-04 13:10:47

interesting, thanks!

banister 2009-08-04 13:12:30

ansaurus

tags:

views:

answers:

What's the most efficient way to multiply 4 floats by 4 floats using SSE ?

related questions