views:

310

answers:

6

Hi,
i'm writing a C# class to perform 2D separable convolution using integers to obtain better performance than double counterpart. The problem is that i don't obtain a real performance gain.

This is the X filter code (it is valid both for int and double cases):

foreach (pixel)
{
      int value = 0;
      for (int k = 0; k < filterOffsetsX.Length; k++)
      {
          value += InputImage[index + filterOffsetsX[k]] * filterValuesX[k];  //index is relative to current pixel position
      }
      tempImage[index] = value;
 }

In the integer case "value", "InputImage" and "tempImage" are of "int", "Image<byte>" and "Image<int>" types.
In the double case "value", "InputImage" and "tempImage" are of "double", "Image<double>" and "Image<double>" types.
(filterValues is int[] in each case)
(The class Image<T> is part of an extern dll. It should be similar to .NET Drawing Image class..).

My goal is to achieve fast perfomance thanks to int += (byte * int) vs double += (double * int)

The following times are mean of 200 repetitions.
Filter size 9 = 0.031 (double) 0.027 (int)
Filter size 13 = 0.042 (double) 0.038 (int)
Filter size 25 = 0.078 (double) 0.070 (int)

The performance gain is minimal. Can this be caused by pipeline stall and suboptimal code?

EDIT: simplified the code deleting unimportant vars.

EDIT2: i don't think i have a cache miss related problema because "index"iterate through adjacent memory cells (row after row fashion). Moreover "filterOffstetsX" contains only small offsets relatives to pixels on the same row and at a max distance of filter size / 2. The problem can be present in the second separable filter (Y-filter) but times are not so different.

+2  A: 

at least it is not fair to compare int (DWORD, 4 bytes) and double (QWORD, 8 bytes) on 32-bit system. Compare int to float or long to double to get fair results. double has increased precision, you must pay for it.

PS: for me it smells like micro(+premature) optimization, and that smell is not good.

Edit: Ok, good point. It is not correct to compare long to double, but still comparing int and double on 32 CPU is not correct even if they have both intrinsic instructions. This is not magic, x86 is fat CISC, still double is not processed as single step internally.

Andrey
The code is optimized and works very well. However the original version only accepts double images. Introducing ints i expected to have a gain thanks to 32 bit integer operations.
Marco
+1 mainly for the PS part.
T.E.D.
comparing long to double isn't fair either - IIRC, an x86 CPU has arithmetic instructions for double precision floats, but not for 64 bit integers.
nikie
@nikie that's true, but still comparing int and double on 32 CPU is not correct even if they have both intrinsic instructions. This is not magic, x86 is fat CISC, still double is not processed as single step internally.
Andrey
Actually, it is fair to compare 4 byte int ops against 8 byte floating point ops in some cases. The C compiler may cast floats to doubles to do arithmetic and then back again to float. Floating point hardware usually has the ability to work on 8 byte floats natively.
JeremyP
+6  A: 

It seems like you are saying you are only running that inner loop 5000 times in even your longest case. The FPU last I checked (admittedly a long time ago) only took about 5 more cycles to perform a multiply than the integer unit. So by using integers you would be saving about 25,000 CPU cycles. That's assuming no cache misses or anything else that would cause the CPU to sit and wait in either event.

Assuming a modern Intel Core CPU clocked in the neighborhood of 2.5Ghz, You could expect to have saved about 10 microseconds runtime by using the integer unit. Kinda paltry. I do realtime programming for a living, and we wouldn't sweat that much CPU wastage here, even if we were missing a deadline somewhere.

digEmAll makes a very good point in the comments though. If the compiler and optimizer are doing their jobs, the entire thing is pipelined. That means that in actuality the entire innner loop will take 5 cycles longer to run with the FPU than the Integer Unit, not each operation in it. If that were the case, your expected time savings would be so small it would be tough to measure them.

If you really are doing enough floating-point ops to make the entire shebang take a very long time, I'd suggest looking into doing one or more of the following:

  1. Parallelize your algorithm and run it on every CPU available from your processor.
  2. Don't run it on the CLR (use native C++, or Ada or Fortran or something).
  3. Rewrite it to run on the GPU. GPUs are essentially array processors and are designed to do massively parallel math on arrays of floating-point values.
T.E.D.
Ok i simplified a bit the code. The outer loop runs on the entire image (500000 iterations for example). Then for each pixel the inner loop (no more than 25-27 iterations here, depending from the filter size) calculates the new pixel value. I think you guys are right.
Marco
Yeah, i really need to enhance the code because i need it to build a scale space (an array of images smoothed with increasing gaussians) in few time. It seems parallelization or is the only (simple to implement) viable choice
Marco
It **has** to run on the CLR then? That seemed the simplest of the three to try to me.
T.E.D.
@T.E.D.: .NET code is compiled at run-time to native machine code, and most operations will run more or less the same speed as direct-to-native code. There are exceptions, such as array indexing, which is slower because it's a function call that does bounds checking. For that reason, I recommended using pointers instead of array indexing, because I think that is actually the slowest part of his loop, according to my tests.
P Daddy
@P Daddy - I know how the CLR works. For most operations it is indeed not a lot slower. However, this is not "most operations", and "not a lot slower" can really add up when you do it half a million times in a tight loop. Optimizing stuff like this is the forte` of traditional native compilers like C++, Fortran, Ada, etc, so let them do what they do best, and let your C# compiler do what it does best (make everything else easier). So just move the one routine with the tight loop out of the CLR into a native project, and call it from the CLR code. If it doesn't help enough, put it back.
T.E.D.
@T.E.D.: I gave a fairly thorough treatment of this topic here: http://stackoverflow.com/questions/2505162/transfer-of-control-and-data-between-c-and-c/2505358#2505358. Of note, the cost of the P/Invoke layer may outweigh small benefits of implementing in a lower-level language. And I really think that in this case, the benefits would be non-existent or nearly so.
P Daddy
It's *possible* that avoiding the CLR will speed things up, but that's far from guaranteed. In other words, try it, but measure both ways.
Steven Sudit
A: 

On my machine, I find that floating-point multiplication is about the same speed as integer multiplication.

I'm using this timing function:

static void Time<T>(int count, string desc, Func<T> action){
    action();

    Stopwatch sw = Stopwatch.StartNew();
    for(int i = 0; i < count; i++)
        action();

    double seconds = sw.Elapsed.TotalSeconds;

    Console.WriteLine("{0} took {1} seconds", desc, seconds);
}

Let's say you're processing a 200 x 200 array with a 25-length filter 200 times, then your inner loop is executing 200 * 200 * 25 * 200 = 200,000,000 times. Each time, you're doing one multiply, one add, and 3 array indices. So I use this profiling code

const int count = 200000000;

int[]  a = {1};
double d = 5;
int    i = 5;

Time(count, "array index", ()=>a[0]);
Time(count, "double mult", ()=>d * 6);
Time(count, "double add ", ()=>d + 6);
Time(count, "int    mult", ()=>i * 6);
Time(count, "int    add ", ()=>i + 6);

On my machine (slower than yours, I think), I get the following results:

array index took 1.4076632 seconds
double mult took 1.2203911 seconds
double add  took 1.2342998 seconds
int    mult took 1.2170384 seconds
int    add  took 1.0945793 seconds

As you see, integer multiplication, floating-point multiplication, and floating-point addition all took about the same time. Array indexing took a little longer (and you're doing it three times), and integer addition was a little faster.

So I think the performance advantage to integer math in your scenario is just too slight to make a significant difference, especially when outweighed by the relatively huge penalty you're paying for array indexing. If you really need to speed this up, then you should use unsafe pointers to your arrays to avoid the offset calculation and bounds checking.

By the way, the performance difference for division is much more striking. Following the pattern above, I get:

double div  took 3.8597251 seconds
int    div  took 1.7824505 seconds

One more note:

Just to be clear, all profiling should be done with an optimized release build. Debug builds will be slower overall, and some operations may not have accurate timing with respect to others.

P Daddy
Try the same algorithim with either `float` or `_int64`, so you aren't timing differences due to the sheer amount of data being moved around.
T.E.D.
@T.E.D.: I don't think that's necessary. For one, I'm not really moving data around, since I'm performing the operation on the same datum repeatedly, I should be staying in L1 cache the whole time. So I'm pretty much timing just the operation itself. For two, I'm already showing that floating-point multiply is about the same speed as integer multiply. There's really no need to prove it further. By the way, there is no `_int64` type in C# (or in any other language, for that matter. The non-standard MSVC extension is `__int64`, with two underscores). In C#, `long` is 64-bits.
P Daddy
Ahh, I see. I was just looking at the division numbers, and was thinking the 2x might have something to do w/ the data being twice as much, but I see now from the numbers above that is not the case. I do remember from my school days (reading about the brand-spanking new "Pentium" chip) that floating-point division took by far more cycles than anything else. That's why a lot of folks try to rejigger all FP math as multiplies (by reciprocating one of the arguments outside of the tight loop).
T.E.D.
A: 

If the times you measuerd are accurate, then the runtime of your filtering algorithm seems to grow with the cube of the filter size. What kind of filter is that? Maybe you can reduce the number of multiplications needed. (e.g. if you're using a separable filter kernel?)

Otherwise, if you need raw performance, you might consider using a library like the Intel Performance Primitives - it contains highly optimized functions for things like this that use CPU SIMD instructions. They're usually a lot faster than hand-written code in C# or C++.

nikie
i'm using a gaussian separable kernel of increasing size (i'm calculating a scale-space, it is a very time-consuming process). Do IPP works in .NET?
Marco
@Marco: Applying a gaussian filter with size k to an image should be an O(k) operation, but the times you posted didn't look like O(k). Maybe you should check your algorithm. That said, yes, you can use IPP from C# using P/Invoke. Or you can use a .NET wrapper for OpenCV (OpenCV uses the IPP automatically if it's available)
nikie
Well the code i posted is 1D filter, but this is only half the original 2D filter. The filter is applied along 2 dimensions, so total complexity shold be 2*k*m*n (m*m = image size)
Marco
A: 

Did you try looking at the disassembled code? In high-level languages i'm pretty much trusting the compiler to optimize my code. For example for(i=0;i<imageSize;i++) might be faster than foreach. Also, arithmetic operrations might get optimized by the compiler anyway.... when you need to optimize something you either optimize the whole "black-box" and maybe reinvent the algorithm used in that loop, or you first take a look at the dissasembled code and see whats wrong with it

Quamis
+2  A: 

Your algorithm seems to access large regions of memory in a very non-sequential pattern. It's probably generating tons of cache misses. The bottleneck is probably memory access, not arithmetic. Using ints should make this slightly faster because ints are 32 bits, while doubles are 64 bits, meaning cache will be used slightly more efficiently. If almost every loop iteration involves a cache miss, though, you're basically out of luck unless you can make some algorithmic or data structure layout changes to improve the locality of reference.

BTW, have you considered using an FFT for convolution? That would put you in a completely different big-O class.

dsimcha
+1 for suggesting something other than micro-optimization.
Steven Sudit
Can you explain me why do you think i have many cache-miss? I'm using a linear "index" that iterate through adjacent memory cells. Also the "filterOffstetsX" contains small offsets relatives to pixels on the same row and at a max distance of filter size / 2.
Marco
I think FFT does not suites very well for my problem (building a pyramid of incrementally convolved images). I do not have a single big convolution but an iteration of convolutions with small to medium filters. If i figure how to apply FFT in a incremental fashion then it could be useful..
Marco
The cache miss problem however may be present in the Y-filtering part of the separable convolution. But this does not seem to affect very much performance (i have similar times both from X and Y filtering)
Marco