tags:

views:

117

answers:

3
+1  Q: 

SIMD Programming

I am using SSE extensions available in Core2Duo processor (compiler gcc 4.4.1). I see that there are 16 registers available each of which is 128 bit long. Now, I can accommodate 4 integer values into a single register, and 4 in another register and using intrinsics I can add them in one instruction. The obvious advantage is this way I require only 1 instruction instead of 4.

My question is "is that all for SIMD?". Let I have a1, a2, a3, a4, a5, a6, a7, a8 and b1, b2, b3, b4, b5, b6, b7, b8. Let A1, A2 are vector registers. Now, A1 <<< (a1, a2, a3, a4) and B1 <<< (b1, b2, b3, b4), and add (A1, B1) will perform the vector addition.

Let A2 <<< (a5, a6, a7, a8), B2 <<< (b5, b6, b7, b8). Is there an add instruction which can do add(A1, B1) and add(A2, B2) simultaneously.

How many vector functional units are available in core2duo and where can I get these informations?

Any other source of informations related to these is highly appreciated.

A: 

The intel site contains all the info you'll ever need!

http://www.intel.com/products/processor/manuals/

Edit in answer to the comment: All the info is in the links linked to above but No. You could pack 8 16-bit integers into 1 register and thus perform 8 simultaneous adds but no SSE does not allow for adding 2 registers simultaneously.

Goz
Can you at least give the answer for the first question? That is whether same add instruction can add two different sets of 4 integers.
anup
A: 

I don't think there's a single instruction to do this (unless they snuck one into a recent version of SSE).

However, since the operations that you're doing are independent, the compiler can issue the second add instruction before the first one finishes. So the timeline would look something like

begin C1 = A1 + B1
begin C2 = A2 + B2
wait
end C1 = A1 + B1
end C2 = A2 + B2

So even though you're using two instructions, you're not necessarily taking twice the time. The actual duration of the wait will depend on the processor and the latency of the particular instruction that you're using.

Here's a more detailed explanation of pipelining: http://en.wikipedia.org/wiki/Instruction_pipeline

For help on SIMD programming in general, Apple's SSE page is pretty good. It's somewhat geared towards people migrating applications from PowerPC to SSE, but there's some good general information there too.

celion
+1  A: 
  • No, there isn't any single SSE instruction to do that. You need to issue two instructions. Are you thinking of something like the x86 string instructions and the REP prefix? There's no SSE equivalent.

  • The two 4-wide vector operations will be executed concerrently in the sense that all modern processors are highly pipelined. The second instruction will go down the pipe only 1 cycle behind the first (assuming the two aren't interdependent, which is the case in your example), so their execution will overlap in time, except for that one cycle.

  • Each core of your multi-core processor has its own vector functional unit. You have to write multi-threaded code to take advantage of this.

  • Some cpus have 1 vector unit per core, some have only 1/2! In the latter case, the vector unit is only 64-bits wide and only executes one-half of the SSE instruction at a time. You get what you pay for.

  • You should look into AVX, the new instruction set extension that evolves SSE to support wider vector units.

  • Or you could look into real vector programming on a GPU with OpenCL or Cuda.

Die in Sente