ansaurus

Question

SIMD Programming

Answer 1

A:

The intel site contains all the info you'll ever need!

http://www.intel.com/products/processor/manuals/

Edit in answer to the comment: All the info is in the links linked to above but No. You could pack 8 16-bit integers into 1 register and thus perform 8 simultaneous adds but no SSE does not allow for adding 2 registers simultaneously.

Goz 2010-09-11 11:32:51

Can you at least give the answer for the first question? That is whether same add instruction can add two different sets of 4 integers.

anup 2010-09-11 11:36:01

Answer 2

A:

I don't think there's a single instruction to do this (unless they snuck one into a recent version of SSE).

However, since the operations that you're doing are independent, the compiler can issue the second add instruction before the first one finishes. So the timeline would look something like

begin C1 = A1 + B1
begin C2 = A2 + B2
wait
end C1 = A1 + B1
end C2 = A2 + B2

So even though you're using two instructions, you're not necessarily taking twice the time. The actual duration of the wait will depend on the processor and the latency of the particular instruction that you're using.

Here's a more detailed explanation of pipelining: http://en.wikipedia.org/wiki/Instruction_pipeline

For help on SIMD programming in general, Apple's SSE page is pretty good. It's somewhat geared towards people migrating applications from PowerPC to SSE, but there's some good general information there too.

celion 2010-09-11 13:29:29

Answer 3

+1 A:

No, there isn't any single SSE instruction to do that. You need to issue two instructions. Are you thinking of something like the x86 string instructions and the REP prefix? There's no SSE equivalent.
The two 4-wide vector operations will be executed concerrently in the sense that all modern processors are highly pipelined. The second instruction will go down the pipe only 1 cycle behind the first (assuming the two aren't interdependent, which is the case in your example), so their execution will overlap in time, except for that one cycle.
Each core of your multi-core processor has its own vector functional unit. You have to write multi-threaded code to take advantage of this.
Some cpus have 1 vector unit per core, some have only 1/2! In the latter case, the vector unit is only 64-bits wide and only executes one-half of the SSE instruction at a time. You get what you pay for.
You should look into AVX, the new instruction set extension that evolves SSE to support wider vector units.
Or you could look into real vector programming on a GPU with OpenCL or Cuda.

Die in Sente 2010-09-11 14:56:15

ansaurus

tags:

views:

answers:

SIMD Programming

related questions