How much speed-up from converting 3D maths to SSE or other SIMD?

These days all the good compilers for x86 generate SSE instructions for SP and DP float math by default. It's nearly always faster to use these instructions than the native ones, even for scalar operations, so long as you schedule them correctly. This will come as a surprise to many, who in the past found SSE to be "slow", and thought compilers could not generate fast SSE scalar instructions. But now, you have to use a switch to turn off SSE generation and use x87. Note that x87 is effectively deprecated at this point and may be removed from future processors entirely. The one down point of this is we may lose the ability to do 80bit DP float in register. But the consensus seems to be if you are depending on 80bit instead of 64bit DP floats for the precision, your should look for a more precision loss-tolerant algorithm.

Everything above came as a complete surprise to me. It's very counter intuitive. But data talks.

For some very rough numbers: I've heard some people on ompf.org claim 10x speed ups for some hand-optimized ray tracing routines. I've also had some good speed ups. I estimate I got somewhere between 2x and 6x on my routines depending on the problem, and many of these had a couple of unnecessary stores and loads. If you have a huge amount of branching in your code, forget about it, but for problems that are naturally data-parallel you can do quite well.

However, I should add that your algorithms should be designed for data-parallel execution. This means that if you have a generic math library as you've mentioned then it should take packed vectors rather than individual vectors or you'll just be wasting your time.

E.g. Something like

namespace SIMD {
class PackedVec4d
{
  __m128 x;
  __m128 y;
  __m128 z;
  __m128 w;

  //...
};
}

Most problems where performance matters can be parallelized since you'll most likely be working with a large dataset. Your problem sounds like a case of premature optimization to me.

Could someone please comment what is wrong with the answer? I will be glad to improve it if possible.

Suma 2008-09-22 15:09:31

The answer is wrong because it's not helpful and wrong according to the experience of other people. There's a reason these instructions were introduced and why people build libraries using them.

ΤΖΩΤΖΙΟΥ 2008-10-06 12:25:16

Also, note that the article you link to does not "debunk" the "myth" for all 3D operations, just for the specific struct { float x, y, z, w; }

ΤΖΩΤΖΙΟΥ 2008-10-06 12:30:26

Yes, there's a reason, but the reason is not a 3D math, the reason is data parallel processing. Effective use of SIMD instructions requires large enough homogeneous data structures. You are likely to see some gain on matrix multiplication (even on 4x4 matrix), but not on 3D/4D vector operations.

Suma 2008-11-12 20:18:45

The article draws what looks like a general conclusion from a specific case and thus ends up misleading. The memory "issues" f.i. are often irrelevant in practical implementations as you'll work on individual vectors taken from arrays rather than allocated in isolation. The load/store overhead, a given using regular FPU, can be alleviated with SIMD as more values can be held in SIMD registers simultaneously. Some common 3D/4D operations have more efficient implementations in the SIMD set (such as normalizations or divisions) than what the FPU offers. etc.

Eric Grange 2009-07-21 20:22:40

The article conclusion agrees with my practical experience. I have converted our general purpose 3D math library into SIMD, profiling and timing carefully during the process, and what I have observed is almost exactly what is described in that article. It is hard for me to argue what is typical and what not, but in my experience working with isolated vectors (spread across miscellaneous data structures) is a lot more common than working with "vectors taken from arrays", which is what I ment by "large homogeneous data".

Suma 2009-07-22 10:43:46

Ignoring how ridiculously lame it is to ask a question and answer it in this fashion seconds later, this is an awful answer, and that is why it is voted down. Why will you only see a small speedup? Can you provide numbers? Platforms tested on? This answer is even more vague than the original question, which was vague to begin with.

unforgiven3 2009-09-09 19:25:30

I am toying with a raytracer in my spare time, and I get a significant speedup from use SSE SIMD instructions even without ray packets. Just because the "ubiquitous SSE vector" isn't the best, doesn't mean that it isn't better.

Tom 2009-12-04 03:16:29

Interesting. How do you use SSE SIMD instructions? As a replacement for general 3D/4D arithmetics?

Suma 2009-12-04 10:17:18

@Suma: I use `__m128` as a replacement for most uses of `float[3]`. In particular, a ray-box intersection can be done with parallel operations on `__m128 ray_origin, ray_direction, box_min, box_max`. 3 floats for 3 dimensions, and the 4th float is used for ray bounds, initially`0` and `std::numeric_limits<float>::max()`. The only non-SSE-esque behavior is a horizontal min and max (the max "entry time" should be less than the min "exit time" across the 4 dimensions).

Tom 2009-12-12 02:00:31

I too recently began writing a raytracer and I get a massive (easily more than 2x speedup _overall_) speedup when I allow MSVC's optimizer to use SSE2 instructions. I didn't do anything special to help it optimize (it was even my first time enabling the option). I get a massive speedup whether I use single or double precision.

guesser 2010-05-11 20:51:14

I doubt that x87 will be removed entirely. There's too much software out there that relies on it.

Nathan Fellman 2009-09-09 13:16:28

@Nathan: Agreed, +1. It will likely just be stripped down to be slow as molasses at least relative to the total transistor count of the chip, like it was in the P4.

dsimcha 2010-03-02 21:59:36

I think your answer just confirms the point I have tried to made with my answer: if you want a speed up, do not convert your general 3D math, rather make your whole computaion SIMD friendly. Converting your 3D math will not help much, if at all. Still it seems most other posters disagree. Some superstitions seem to have deep roots.

Suma 2009-09-09 13:54:46

ansaurus

tags:

views:

answers:

How much speed-up from converting 3D maths to SSE or other SIMD?

related questions