views:

1031

answers:

6

I am using 3D maths in my application extensively. How much speed-up can I achieve by converting my vector/matrix library to SSE, AltiVec or a similar SIMD code?

A: 

Most likely you will see only very small speedup, if any, and the process will be more complicated than expected. For more details see The Ubiquitous SSE vector class article by Fabian Giesen.

Suma
Could someone please comment what is wrong with the answer? I will be glad to improve it if possible.
Suma
The answer is wrong because it's not helpful and wrong according to the experience of other people. There's a reason these instructions were introduced and why people build libraries using them.
ΤΖΩΤΖΙΟΥ
Also, note that the article you link to does not "debunk" the "myth" for all 3D operations, just for the specific struct { float x, y, z, w; }
ΤΖΩΤΖΙΟΥ
Yes, there's a reason, but the reason is not a 3D math, the reason is data parallel processing. Effective use of SIMD instructions requires large enough homogeneous data structures. You are likely to see some gain on matrix multiplication (even on 4x4 matrix), but not on 3D/4D vector operations.
Suma
The article draws what looks like a general conclusion from a specific case and thus ends up misleading. The memory "issues" f.i. are often irrelevant in practical implementations as you'll work on individual vectors taken from arrays rather than allocated in isolation. The load/store overhead, a given using regular FPU, can be alleviated with SIMD as more values can be held in SIMD registers simultaneously. Some common 3D/4D operations have more efficient implementations in the SIMD set (such as normalizations or divisions) than what the FPU offers. etc.
Eric Grange
The article conclusion agrees with my practical experience. I have converted our general purpose 3D math library into SIMD, profiling and timing carefully during the process, and what I have observed is almost exactly what is described in that article. It is hard for me to argue what is typical and what not, but in my experience working with isolated vectors (spread across miscellaneous data structures) is a lot more common than working with "vectors taken from arrays", which is what I ment by "large homogeneous data".
Suma
Ignoring how ridiculously lame it is to ask a question and answer it in this fashion seconds later, this is an awful answer, and that is why it is voted down. Why will you only see a small speedup? Can you provide numbers? Platforms tested on? This answer is even more vague than the original question, which was vague to begin with.
unforgiven3
I am toying with a raytracer in my spare time, and I get a significant speedup from use SSE SIMD instructions even without ray packets. Just because the "ubiquitous SSE vector" isn't the best, doesn't mean that it isn't better.
Tom
Interesting. How do you use SSE SIMD instructions? As a replacement for general 3D/4D arithmetics?
Suma
@Suma: I use `__m128` as a replacement for most uses of `float[3]`. In particular, a ray-box intersection can be done with parallel operations on `__m128 ray_origin, ray_direction, box_min, box_max`. 3 floats for 3 dimensions, and the 4th float is used for ray bounds, initially`0` and `std::numeric_limits<float>::max()`. The only non-SSE-esque behavior is a horizontal min and max (the max "entry time" should be less than the min "exit time" across the 4 dimensions).
Tom
I too recently began writing a raytracer and I get a massive (easily more than 2x speedup _overall_) speedup when I allow MSVC's optimizer to use SSE2 instructions. I didn't do anything special to help it optimize (it was even my first time enabling the option). I get a massive speedup whether I use single or double precision.
guesser
+1  A: 

That's not the whole story, but it's possible to get further optimizations using SIMD, have a look at Miguel's presentation about when he implemented SIMD instructions with MONO which he held at PDC 2008,

SIMD beats doubles' ass in this particular configuration.

Picture from Miguel's blog entry.

Henrik
+1  A: 

The answer highly depends on what the library is doing and how it is used.

The gains can go from a few percent points, to "several times faster", the areas most susceptible of seeing gains are those where you're not dealing with isolated vectors or values, but multiple vectors or values that have to be processed in the same way.

Another area is when you're hitting cache or memory limits, which, again, requires a lot of values/vectors being processed.

The domains where gains can be the most drastic are probably those of image and signal processing, computational simulations, as well general 3D maths operation on meshes (rather than isolated vectors).

Eric Grange
+1  A: 

These days all the good compilers for x86 generate SSE instructions for SP and DP float math by default. It's nearly always faster to use these instructions than the native ones, even for scalar operations, so long as you schedule them correctly. This will come as a surprise to many, who in the past found SSE to be "slow", and thought compilers could not generate fast SSE scalar instructions. But now, you have to use a switch to turn off SSE generation and use x87. Note that x87 is effectively deprecated at this point and may be removed from future processors entirely. The one down point of this is we may lose the ability to do 80bit DP float in register. But the consensus seems to be if you are depending on 80bit instead of 64bit DP floats for the precision, your should look for a more precision loss-tolerant algorithm.

Everything above came as a complete surprise to me. It's very counter intuitive. But data talks.

I doubt that x87 will be removed entirely. There's too much software out there that relies on it.
Nathan Fellman
@Nathan: Agreed, +1. It will likely just be stripped down to be slow as molasses at least relative to the total transistor count of the chip, like it was in the P4.
dsimcha
+1  A: 

For some very rough numbers: I've heard some people on ompf.org claim 10x speed ups for some hand-optimized ray tracing routines. I've also had some good speed ups. I estimate I got somewhere between 2x and 6x on my routines depending on the problem, and many of these had a couple of unnecessary stores and loads. If you have a huge amount of branching in your code, forget about it, but for problems that are naturally data-parallel you can do quite well.

However, I should add that your algorithms should be designed for data-parallel execution. This means that if you have a generic math library as you've mentioned then it should take packed vectors rather than individual vectors or you'll just be wasting your time.

E.g. Something like

namespace SIMD {
class PackedVec4d
{
  __m128 x;
  __m128 y;
  __m128 z;
  __m128 w;

  //...
};
}

Most problems where performance matters can be parallelized since you'll most likely be working with a large dataset. Your problem sounds like a case of premature optimization to me.

Rehno Lindeque
I think your answer just confirms the point I have tried to made with my answer: if you want a speed up, do not convert your general 3D math, rather make your whole computaion SIMD friendly. Converting your 3D math will not help much, if at all. Still it seems most other posters disagree. Some superstitions seem to have deep roots.
Suma
A: 

In my experience I typically see about a 3x improvement in taking an algorithm from x87 to SSE, and a better than 5x improvement in going to VMX/Altivec (because of complicated issues having to do with pipeline depth, scheduling, etc). But I usually only do this in cases where I have hundreds or thousands of numbers to operate on, not for those where I'm doing one vector at a time ad hoc.

Crashworks