Rather than hand-coding an alternative SSE implementation to your scalar code, I strongly suggest you have a look at OpenCL. It is a vendor-neutral portable, cross-platform system for computationally intensive applications (and is highly buzzword-compliant!). You can write your algorithm in a subset of C99 designed for vectorised operations, which is much easier than hand-coding SSE. And best of all, OpenCL will generate the best implementation at runtime, to execute either on the GPU or on the CPU. So basically you get the SSE code written for you.
Theres are couple of places in my code base where the same operation is repeated a very large number of times for a large data set. In some cases it's taking a considerable time to process these.
Your application sounds like just the kind of problem that OpenCL is designed to address. Writing alternative functions in SSE would certainly improve the execution speed, but it is a great deal of work to write and debug.
Is there a compiler and OS independent way writing the code to take advantage of SSE instructions? I like the VC++ intrinsics, which include SSE operations, but I haven't found any cross compiler solutions.
Yes. The SSE intrinsics have been essentially standardised by Intel, so the same functions work the same between Windows, Linux and Mac (specifically with Visual C++ and GNU g++).
I still need to support some CPU's that either have no or limited SSE support (eg Intel Celeron). Is there some way to avoid having to make different versions of the program, like having some kind of "run time linker" that links in either the basic or SSE optimised code based on the CPU running it when the process is started?
You could do that (eg. using dlopen()
) but it is a very complex solution. Much simpler would be (in C) to define a function interface and call the appropriate version of the optimised function via function pointer, or in C++ to use different implementation classes, depending on the CPU detected.
With OpenCL it is not necessary to do this, as the code is generated at runtime for the given architecture.
What about other CPU extensions, looking at the instruction sets of various Intel and AMD CPU's shows there are a few of them?
Within the SSE instruction set, there are many flavours. It can be quite difficult to code the same algorithm in different subsets of SSE when certain instructions are not present. I suggest (at least to begin with) that you choose a minimum supported level, such as SSE2, and fall back to the scalar implementation on older machines.
This is also an ideal situation for unit/regression testing, which is very important to ensure your different implementations produce the same results. Have a test suite of input data and known good output data, and run the same data through both versions of the processing function. You may need to have a precision test for passing (ie. the difference epsilon between the result and the correct answer is below 1e6
, for example). This will greatly aid in debugging, and if you build in high-resolution timing to your testing framework, you can compare the performance improvements at the same time.