The answer is, sadly, far more mundane than these suggestions, although they do help. Basically, I messed up with how I was doing timing.
The timing code that I was using was this:
Ipp32s timer;
ippGetCpuFreqMhz(&timer);
Ipp64u globalStart = ippGetCpuClocks();
globalStart = ippGetCpuClocks() *2 - globalStart; //use this method to get rid of the overhead of getting clock ticks
//do some stuff
Ipp64u globalEnd = ippGetCpuClocks();
globalEnd = ippGetCpuClocks() *2 - globalEnd;
std::cout << "total runtime: " << ((Ipp64f)globalEnd - (Ipp64f)globalStart)/((Ipp64f)timer *1000000.0f) << " seconds" << std::endl;
This code is specific to the intel compiler, and is designed to give extremely precise time measurements. Unfortunately, that extreme precision means a cost of roughly 2.5 seconds per run. Removing the timing code removed that time constraint.
There still appears to be a delay of the runtime, though-- the code would report 0.24 s with that timing code on, and is now reporting timing of roughly 0.35s, which means that there's about a 50% speed cost.
Changing the code to this:
static extern void ImageProcessing(
IntPtr inImage, //[MarshalAs(UnmanagedType.LPArray)]ushort[] inImage,
IntPtr outImage, //[MarshalAs(UnmanagedType.LPArray)]ushort[] outImage,
int inYSize, int inXSize);
and called like:
unsafe {
fixed (ushort* inImagePtr = theInputImage.DataArray){
fixed (ushort* outImagePtr = theResult){
ImageProcessing((IntPtr)inImagePtr,//theInputImage.DataArray,
(IntPtr)outImagePtr,//theResult,
ysize,
xsize);
}
}
}
drops the executable time to 0.3 s (average of three runs). Still too slow for my tastes, but a 10x speed improvement is certainly within the realm of acceptability for my boss.