views:

476

answers:

3

Hi all,

I have a C++ method signature that looks like this:

    static extern void ImageProcessing(
        [MarshalAs(UnmanagedType.LPArray)]ushort[] inImage,
        [MarshalAs(UnmanagedType.LPArray)]ushort[] outImage,
        int inYSize, int inXSize);

I've wrapped the function in timing methods, both internal and external. Internally, the function is running at 0.24s. Externally, the function runs in 2.8s, or about 12 times slower. What's going on? Is marshalling slowing me down that much? If it is, how can I get around that? Should I go to unsafe code and use pointers or something? I'm sort of flummoxed as to where the extra time cost is coming from.

+3  A: 

Take a look at this article. While it's focus is on the Compact Framework, the general principles apply to the desktop as well. A relevant quote from the analysis section is as follows:

The managed call doesn't directly call the native method. Instead it calls into a JITted stub method that must perform some overhead routines such as calls to determine GC Preemption status (to determine if a GC is pending and we need to wait). It is also possible that some marshalling code will get JITted into the stub as well. This all takes time.

Edit: Also worth a read is this blog article on perf of JITted code - again, CF specific, but still relevant. There is also an article covering call stack depth and its impact on perf, though this one is probably CF-specific (not tested on the desktop).

ctacke
What I'm gathering from this is that I need to be using unsafe code. I'm cool with that, I'll just make the conversion and run the tests, and let you know my results.
mmr
+1  A: 

Have you tried switching the two array parameters to IntPtr? PInvoke is at it's absolute fastest when all of the types in the marshalling signature are blittable. This means Pinvoke comes down to a mere memcpy to get the data back and forth.

In my team we've found the most performant way to manage our PInvoke layer is to

  1. Guarantee that everything being Marshall'd is blittable
  2. Pay the price to manually Marshal types such as arrays by manipulating an IntPtr class on an as needed basis. This is very trivial as we have many wrapper methods/classes.

As with any "this will be faster" answer, you'll need to profile this is your own code base. We only arrived at this solution after several methods were considered and profiled.

JaredPar
I'll test it out.
mmr
A: 

The answer is, sadly, far more mundane than these suggestions, although they do help. Basically, I messed up with how I was doing timing.

The timing code that I was using was this:

Ipp32s timer;
ippGetCpuFreqMhz(&timer);
Ipp64u globalStart = ippGetCpuClocks();
globalStart = ippGetCpuClocks() *2 - globalStart; //use this method to get rid of the overhead of getting clock ticks

      //do some stuff

Ipp64u globalEnd = ippGetCpuClocks(); 
globalEnd = ippGetCpuClocks() *2 - globalEnd;
std::cout << "total runtime: " << ((Ipp64f)globalEnd - (Ipp64f)globalStart)/((Ipp64f)timer *1000000.0f) << " seconds" << std::endl;

This code is specific to the intel compiler, and is designed to give extremely precise time measurements. Unfortunately, that extreme precision means a cost of roughly 2.5 seconds per run. Removing the timing code removed that time constraint.

There still appears to be a delay of the runtime, though-- the code would report 0.24 s with that timing code on, and is now reporting timing of roughly 0.35s, which means that there's about a 50% speed cost.

Changing the code to this:

  static extern void ImageProcessing(
     IntPtr inImage, //[MarshalAs(UnmanagedType.LPArray)]ushort[] inImage,
     IntPtr outImage, //[MarshalAs(UnmanagedType.LPArray)]ushort[] outImage,
     int inYSize, int inXSize);

and called like:

        unsafe {
            fixed (ushort* inImagePtr = theInputImage.DataArray){
                fixed (ushort* outImagePtr = theResult){
                    ImageProcessing((IntPtr)inImagePtr,//theInputImage.DataArray,
                        (IntPtr)outImagePtr,//theResult,
                        ysize,
                        xsize);
                }
            }
        }

drops the executable time to 0.3 s (average of three runs). Still too slow for my tastes, but a 10x speed improvement is certainly within the realm of acceptability for my boss.

mmr