ansaurus

Question

Answer 1

+2 A:

This is not an outright answer to your question, but a pointer in the right direction. The good people in this forum will love to help you to thrash this question to death. In fact, the code timing macros found right here could be just the thing you need. Tell them I said "hi".

slugster 2009-11-14 09:02:42

Answer 2

+3 A:

When you enable optimizations the non-SSE code is eliminated completely, whereas the SSE code remains there, so this case is trivial. The more interesting part is when the optimizations are turned off: in this case the SSE-code is still slower whereas the loops' code is the same.

Non-SSE code of the innermost loop's body:

movl $0x3dcccccd, %eax
movl %eax, -80(%rbp)
movl $0x3dcccccd, %eax
movl %eax, -76(%rbp)
movl $0x3dcccccd, %eax
movl %eax, -72(%rbp)
movl $0x3dcccccd, %eax
movl %eax, -68(%rbp)
movss -80(%rbp), %xmm1
movss -48(%rbp), %xmm0
mulss %xmm1, %xmm0
movss %xmm0, -80(%rbp)
movss -76(%rbp), %xmm1
movss -44(%rbp), %xmm0
mulss %xmm1, %xmm0
movss %xmm0, -76(%rbp)
movss -72(%rbp), %xmm1
movss -40(%rbp), %xmm0
mulss %xmm1, %xmm0
movss %xmm0, -72(%rbp)
movss -68(%rbp), %xmm1
movss -36(%rbp), %xmm0
mulss %xmm1, %xmm0
movss %xmm0, -68(%rbp)

SSE code of the innermost loop's body:

movl $0x3dcccccd, %eax
movl %eax, -64(%rbp)
movl $0x3dcccccd, %eax
movl %eax, -60(%rbp)
movl $0x3dcccccd, %eax
movl %eax, -56(%rbp)
movl $0x3dcccccd, %eax
movl %eax, -52(%rbp)
leaq -48(%rbp), %rax
leaq -64(%rbp), %rdx
movaps (%rax), %xmm0
mulps (%rdx), %xmm0
movaps %xmm0, (%rdx)

I'm not sure about this, but here's my guess:

As you can see the compiler just stores the 4 floating values by 4 32-bit stores. This is then read back by a 16 byte load. This causes store forwarding stall which is costly when happens. You can look up this in the Intel manuals. It doesn't occur in the scalar version and this makes the performance difference.

To make it faster you need to make sure that this stall doesn't occur. If you are using a constant array of 4 floats, make it const and store the results in an another aligned array. This way the compiler hopefully won't make those unnecessary 4 byte movs before the load. Or, if you need to fill up the resulting array, do it with a 16 byte store command. If you can't avoid those 4 byte movs, you need to do something else after the store but before the load (for example calculating something else).

ypsu 2009-11-14 15:39:26

thanks for your answer. :) However I _really_ need a very fast 4 float multiplier using SSE for image processing. If the method I am using is somehow flawed can you suggest another way to do it that utilizes the power of SSE but does not result in such slow-downs. I read that SSE was actually designed for the kind of image processing that I have in mind so there surely _must_ be a way to do what i want? (I need the fast 4float multipler for operations like alpha blending and such)

banister 2009-11-14 16:22:15

I've updated the post to include the answer.

ypsu 2009-11-14 17:08:40

thanks, ill look up 'store forwarding stall' and try to wrap my head around what's going on here

banister 2009-11-14 17:32:02

Here you go: http://www.sesp.cse.clrc.ac.uk/html/SoftwareTools/vtune/users_guide/mergedProjects/analyzer_ec/mergedProjects/reference_olh/pentium4_hh/lips/lipspro_mem_stall.htm

ypsu 2009-11-14 17:33:44

ansaurus

tags:

views:

answers:

Benchmarking SSE instructions

related questions