views:

3321

answers:

14

I have a loop written in C++ which is executed for each element of a big integer array. Inside the loop, I mask some bits of the integer and then find the min and max values. I heard that if I use SSE instructions for these operations it will run much faster compared to a normal loop written using bitwise AND , and if-else conditions. My question is should I go for these SSE instructions? Also, what happens if my code runs on a different processor? Will it still work or these instructions are processor specific?

+13  A: 
  1. SSE instructions are processor specific. You can look up which processor supports which SSE version on wikipedia.
  2. If SSE code will be faster or not depends on many factors: The first is of course whether the problem is memory-bound or CPU-bound. If the memory bus is the bottleneck SSE will not help much. Try simplifying your integer calculations, if that makes the code faster, it's probably CPU-bound, and you have a good chance of speeding it up.
  3. Be aware that writing SIMD-code is a lot harder than writing C++-code, and that the resulting code is much harder to change. Always keep the C++ code up to date, you'll want it as a comment and to check the correctness of your assembler code.
  4. Think about using a library like the IPP, that implements common low-level SIMD operations optimized for various processors.
Niki
+2  A: 

SSE instructions were originally just on Intel chips, but recently (since Athlon?) AMD supports them as well, so if you do code against the SSE instruction set, you should be portable to most x86 procs.

That being said, it may not be worth your time to learn SSE coding unless you're already familiar with assembler on x86's - an easier option might be to check your compiler docs and see if there are options to allow the compiler to autogenerate SSE code for you. Some compilers do very well vectorizing loops in this way. (You're probably not surprised to hear that the Intel compilers do a good job of this :)

Mike
You do NOT need to know assembly to make use of the simd intrinsics.e.g.x = _mm_mul_ps (y,z) multiplies each of the 4 floats in y by the 4 floats in z and puts the result in x.How easy is that?
Mark Borgerding
+2  A: 

If you use SSE instructions, you're obviously limited to processors that support these. That means x86, dating back to the Pentium 2 or so (can't remember exactly when they were introduced, but it's a long time ago)

SSE2, which, as far as I can recall, is the one that offers integer operations, is somewhat more recent (Pentium 3? Although the first AMD Athlon processors didn't support them)

In any case, you have two options for using these instructions. Either write the entire block of code in assembly (probably a bad idea. That makes it virtually impossible for the compiler to optimize your code, and it's very hard for a human to write efficient assembler).

Alternatively, use the intrinsics available with your compiler (if memory serves, they're usually defined in xmmintrin.h)

But again, the performance may not improve. SSE code poses additional requirements of the data it processes. Mainly, the one to keep in mind is that data must be aligned on 128-bit boundaries. There should also be few or no dependencies between the values loaded into the same register (a 128 bit SSE register can hold 4 ints. Adding the first and the second one together is not optimal. But adding all four ints to the corresponding 4 ints in another register will be fast)

It may be tempting to use a library that wraps all the low-level SSE fiddling, but that might also ruin any potential performance benefit.

I don't know how good SSE's integer operation support is, so that may also be a factor that can limit performance. SSE is mainly targeted at speeding up floating point operations.

jalf
+2  A: 

We have implemented some image processing code, similar to what you describe but on a byte array, In SSE. The speedup compared to C code is considerable, depending on the exact algorithm more than a factor of 4, even in respect to the Intel compiler. However, as you already mentioned you have the following drawbacks:

  • Portability. The code will run on every Intel-like CPU, so also AMD, but not on other CPUs. That is not a problem for us because we control the target hardware. Switching compilers and even to a 64 bit OS can also be a problem.

  • You have a steep learning curve, but I found that after you grasp the principles writing new algorithms is not that hard.

  • Maintainability. Most C or C++ programmers have no knowledge of assembly/SSE.

My advice to you will be to go for it only if you really need the performance improvement, and you can't find a function for your problem in a library like the intel IPP, and if you can live with the portability issues.

Dani van der Meer
+3  A: 

If you intend to use Microsoft Visual C++, you should read this:

http://www.codeproject.com/KB/recipes/sseintro.aspx

Migol
+7  A: 

SIMD, of which SSE is an example, allows you to do the same operation on multiple chunks of data. So, you won't get any advantage to using SSE as a straight replacement for the integer operations, you will only get advantages if you can do the operations on multiple data items at once. This involves loading some data values that are contiguous in memory, doing the required processing and then stepping to the next set of values in the array.

Problems:

1 If the code path is dependant on the data being processed, SIMD becomes much harder to implement. For example:

a = array [index];
a &= mask;
a >>= shift;
if (a < somevalue)
{
  a += 2;
  array [index] = a;
}
++index;

is not easy to do as SIMD:

a1 = array [index] a2 = array [index+1] a3 = array [index+2] a4 = array [index+3]
a1 &= mask         a2 &= mask           a3 &= mask           a4 &= mask
a1 >>= shift       a2 >>= shift         a3 >>= shift         a4 >>= shift
if (a1<somevalue)  if (a2<somevalue)    if (a3<somevalue)    if (a4<somevalue)
  // help! can't conditionally perform this on each column, all columns must do the same thing
index += 4

2 If the data is not contigous then loading the data into the SIMD instructions is cumbersome

3 The code is processor specific. SSE is only on IA32 (Intel/AMD) and not all IA32 cpus support SSE.

You need to analyse the algorithm and the data to see if it can be SSE'd and that requires knowing how SSE works. There's plenty of documentation on Intel's website.

Skizz

Skizz
Problem 1 is generally solved using SIMD mask instructions. Something like __m128 mask = _mm_cmplt_ps(a, somevalue); a = _mm_add_ps(a, _mm_and_ps(mask, _mm_set_ps1(2)); for the if(a < somevalue) a += 2;
Jasper Bekkers
+1  A: 

Although it is true that SSE is specific to some processors (SSE may be relatively safe, SSE2 much less in my experience), you can detect the CPU at runtime, and load the code dynamically depending on the target CPU.

David Cournapeau
+1  A: 

I don't recommend doing this yourself unless you're fairly proficient with assembly. Using SSE will, more than likely, require careful reorganization of your data, as Skizz points out, and the benefit is often questionable at best.

It would probably be much better for you to write very small loops and keep your data very tightly organized and just rely on the compiler doing this for you. Both the Intel C Compiler and GCC (since 4.1) can auto-vectorize your code, and will probably do a better job than you. (Just add -ftree-vectorize to your CXXFLAGS.)

Edit: Another thing I should mention is that several compilers support assembly intrinsics, which would probably, IMO, be easier to use than the asm() or __asm{} syntax.

greyfade
I've yet to see GCC's autovectorizer do more good than harm, though I guess it could always get better.
Crashworks
New versions always progress in features and functionality. I've heard that GCC's vectorizer is fairly good, and better in version 4.3, especially now that it's the default in -O3.
greyfade
A: 

SIMD intrinsics (such as SSE2) can speed this sort of thing up but take expertise to use correctly. They are very sensitive to alignment and pipeline latency; careless use can make performance even worse than it would have been without them. You'll get a much easier and more immediate speedup from simply using cache prefetching to make sure all your ints are in L1 in time for you to operate on them.

Unless your function needs a throughput of better than 100,000,000 integers per second, SIMD probably isn't worth the trouble for you.

Crashworks
+1  A: 

Just to add briefly to what has been said before about different SSE versions being available on different CPUs: This can be checked by looking at the respective feature flags returned by the CPUID instruction (see e.g. Intel's documentation for details).

PhiS
+1  A: 

Have a look at inline assembler for C/C++, here is a DDJ article. Unless you are 100% certain your program will run on a compatible platform you should follow the recommendations many have given here.

epatel
+1  A: 
Mr J
+1  A: 

Write code that helps the compiler understand what you are doing. GCC will understand and optimize SSE code such as this:

typedef union Vector4f
{
  // Easy constructor, defaulted to black/0 vector
 Vector4f(float a = 0, float b = 0, float c = 0, float d = 1.0f):
  X(a), Y(b), Z(c), W(d) { }

  // Cast operator, for []
 inline operator float* ()
 { 
  return (float*)this;
 }

  // Const ast operator, for const []
 inline operator const float* () const
 { 
  return (const float*)this;
 }

 // ---------------------------------------- //

 inline Vector4f operator += (const Vector4f &v)
 {
  for(int i=0; i<4; ++i)
   (*this)[i] += v[i];

  return *this;
 }

 inline Vector4f operator += (float t)
 {
  for(int i=0; i<4; ++i)
   (*this)[i] += t;

  return *this;
 }

  // Vertex / Vector 
  // Lower case xyzw components
 struct {
  float x, y, z;
  float w;
 };

  // Upper case XYZW components
 struct {
  float X, Y, Z;
  float W;
 };
};

Just don't forget to have -msse -msse2 on your build parameters!

LiraNuna
+1  A: 

I can tell from my experince that SSE brings a huge (4x and up) speedup over a plain c version of the code (no inline asm, no intrinsics used) but hand-optimized assembler can beat Compiler-generated assembly if the compiler can't figure out what the programmer intended (belive me, compilers don't cover all possible code combinations and they never will). Oh and, the compiler can't everytime layout the data that it runs at the fastest-possible speed. But you need much experince for a speedup over an Intel-compiler (if possible).

Quonux