tags:

views:

114

answers:

5

Hi,

I need to optimize some C code, which does lots of physics computations, using SIMD extensions on the SPE of the Cell Processor. Each vector operator can process 4 floats at the same time. So ideally I would expect a 4x speedup in the most optimistic case.

Do you think the use of vector operators could give bigger speedups?

Thanks

+3  A: 

On their own, no. But if the process of re-writing your algorithms to support them also happens to improve, say, cache locality or branching behaviour, then you could find unrelated speed-ups. However, this is true of any re-write...

Oli Charlesworth
This unrelated speedups are usually called superliner speedup.
Matias Valdenegro
+3  A: 

It CAN give better speeds up than 4 times over straight floating point as the SIMD instructions could be less exact (Not so much as to give too many problems though) and so take fewer cycles to execute. It really depends.

Best plan is to learn as much about the processor you are optimising for as possible. You may find it can give you far better than 4x improvements. You may find out you can't. We can't say though without knowing more about the algorithm you are optimising and what CPU you are targetting.

Goz
Do you mean going from double to single precision? SSE2 and better supports double precision, and most platforms support IEEE or at least fulfill the precision requirements… which aren't such as to make single-cycle arithmetic uncommon.
Potatoswatter
No I don't. I'm thinking of a few different platforms I've used. One is x86 where using scalar SSE can be many times faster than using x87. Equally on one MIPS based platform parallel instructions executed quicker than their scalar counterparts and even then you could pipeline standard scalar and parallel FPU instructions simultaneously.
Goz
I don't think so. The PowerPC for example has instructions that only estimate the result, which makes them a lot faster, but you lose some precision.
Georg
I am using SPE of the Cell Processor
Werner
@Georg: You've got me :)
Goz
@Werner: If you are using SPEs then the AFAIK the best you can get is a 4x speed up unless you optimise algorithmically.
Goz
IIRC you can get a speedup higher then 4x (for floats) on SPEs, even through it has an SIMD width of 4. The SPE can only load aligned 16byte blocks from its local storage, so using scalars can generate extra Instructions to extract an unaligned scalar from an aligned block and via versa, if you are not carefully (or the compiler is able to skip this by putting the scalar into an aligned block on its own or keeping it in a register (of course that isn't something you can generally count on))
Grizzly
@Grizzly: I wouldn't call that a 4x speed up though because if you have to do scalar maths then you'd, obviously, be best off 16 byte aligning your scalars... Thats kinda like saying you wouldn't get a 4x speed up because you may have to deal with unaligned vectors .... its true but a bit pedantic ;)
Goz
I wouldn't say so, because in many cases you simply don't have the luxury of 16byte aligning your scalars. Afterall many situations where you might want to use simd involve working on large datasets, where its not really an option to waste most of the space just for alignment (since working on 16byte aligned float scalars means wasting 3/4 of the space), exspecially on the spe, which doesn't have that much local memory to begin with. So I don't think comparing unaligned scalars to aligned vectors is that wrong, since it is a common situation.
Grizzly
+3  A: 

The best optimization occurs in rethinking the algorithm. Eliminate unnecessary steps. Find more a direct way of accomplishing the same result. Compute the solution in a domain more relevant to the problem.

For example, if the vector array is a list of n which are all on the same line, then it is sufficient to transform the end points only and interpolate the intermediate points.

wallyk
yes, at the moment this is the best option I am considering.
Werner
All of which is correct, but orthogonal to the question of what SIMD can do for you...
dmckee
what do you mean with 'orthogobal'?
Werner
+1  A: 

It depends on the architecture.. For the moment I assume x86 architecture (aka SSE).

You can get factor four on tight loops easily. Just replace your existing math with SSE instruction and you're done.

You can even get a little more than that because if you use SSE you do the math in registers which are usually not used by the compiler. This frees up the general purpose register for other task such as loop control and address calculation. In short the code that surrounds the SSE instruction will be more compact and execute faster.

And then there is the option to hint the memory controller how you want to access the memory, e.g. if you want to store data in a way that it bypasses the cache or not. For bandwidth hungry algorithms that may give you some more extra speed ontop of that.

Nils Pipenbrinck
this is also a good point
Werner
I am using SPE of the Cell Processor
Werner
Oh, you're working on the SPE. Well here completely different rules apply. processing power is rarely a bottleck. Instead the task is to get the data in and out the SPU without stalling.. Not trivial to do!
Nils Pipenbrinck
yes, although in my case, data transfer is not a problem (it takes 5% of the time)
Werner
+2  A: 

This is entirely possible.

  • You can do more clever instruction-level micro optimizations than a compiler, if you know what you're doing.
  • Most SIMD instruction sets offers several powerful operations that don't have any equivalent in normal scalar FPU/ALU code (e.g. PAVG/PMIN etc. in SSE2). Even if these don't fit your problem exactly, you can often combine these instructions for great effect.
  • Not sure about Cell, but most SIMD instruction sets have features to optimize memory access, for example to prefetch data into cache. I've had very good results with these.

Now this isn't Cell or PPC at all, but a simple image convolution filter of mine got a 20x speedup (C vs. SSE2) on Atom, which is higher than the level of parallelity (16 pixels at a time).

dietr