views:

1315

answers:

14

Hi all!

I'm a fairly competent Java programmer who's very new to C. I am trying to optimize a routine that has four modes of operation.

I loop over all the pixels in an image and compute a new pixel value depending on the 'mode' passed.

My question regards the overhead of a switch statement within two nested for loops. I'd be interested in any links to documentation regarding the relative efficiency of basic C statements, math and logical operations.

The code would go as follows;

for (x = 0; x < width; x++) {
     for (y = 0; y < height; y++) {
       switch (mode)                  /* select the type of calculation */
       {
          case 0:
       weight = dCentre / maxDistanceEdge;
          case 1: 
              weight = (float)x/width;             
              break;
          case 2: 
           weight = (float)y/height;
              break;
          case 3: 
           weight = dBottomLeft / maxDistanceCorner;
              break;
          case 4: 
        weight = dTopRight / maxDistanceCorner;
              break;
          default: 
       weight = 1;
       break;
      }
             // Calculate the new pixel value given the weight
             ...
            }             

    }

Would you expect to see much overhead if this was over a 5000 x 5000 pixel image? I've tried to do some testing but my results are all over the place as the system (Mobile Device) has all sorts of stuff running in the background that may skew results.

The other option is to have a separate method for each mode, each with its own four loops. This would obviously introduce redundant code but efficiency is the name of the game here.

Thanks in advance!

Gav

+4  A: 

Compared with the maths you are doing in the loop, the overhead of the switch will probably be minimal. having said that, the only way to be sure is to create different versions for the two different approaches, and time them.

anon
+5  A: 

Switch statements are about as efficient as they can possibly be. They're compiled to a jump table. In fact, that's why switch is as limited as it is: you can only write a switch for which you can compile a jump tables based on a fixed value.

Charlie Martin
Yes, but they're not totally free and for a 5000 x 5000 image that comparison/lookup and jump would be done 25,000,000 times. Not saying that the switch is definitely his bottleneck, just that we shouldn't be so quick to dismiss removing it since it is cheap
Michael
In fact, it's not required for a switch that it can compile down to a jump table in order for you to write it. If you chose weird values, the compiler may perfectly well decide to use ordinary branches instead of a jump table. seeing a switch and thinking it's zero cost is misleading. I guess a measurement session or a gcc -S session is called for :)
Johannes Schaub - litb
Well, the primary question seemed to be the efficiency of the structure. Actually, overall, he should probably be thinking polymorphism.
Charlie Martin
Charlie Martin
@Charlie, right that is what i was implying. But you say in your answer that you cannot write a switch that can't be translated into a jump table... Which i tried to hint you at. wrt to "zero cost", that wasn't directed to you in particular :)
Johannes Schaub - litb
Polymorphism in C?
Carson Myers
@Carson For some reason I was convinced he'd said C++. Sorry. @litb, I think we're hitting darfen and mußen. The point is that the constraints are there to make it possible; you're right that the compiler can do what it pleases when hit with the specific example.
Charlie Martin
+1  A: 

Switches shouldnt produce any significant overhead, they get compiled into a sort of array of pointers at the low end, then it's a case of effectively:

jmp {baseaddress}+switchcasenum

Adam Frisby
Computed jumps can be expensive, actually :) But as I mentioned in my answer, such a small switch may end up as a decision tree
bdonlan
They are not necessarily compiled into a jump table. So you could write a switch that doesn't use a jump table, if the value relation is too complex to be layered down to such ones
Johannes Schaub - litb
+7  A: 

If efficiency is more important than code size, then yes you should create redundant routines. The case statement is one of the lower overhead things you can do in C, but it's not zero - it's going to have to branch based on the mode, and so it's going to take time. If you really want max performance, get the case out of the loop, even at the cost of duplicating the loop.

Michael Kohne
+1, for a 5,000x5,000 picture you can run the run the switch 1 time or 25,000,000 times, which is slower?
KM
Easily, 25,000,000 would be slower. But are we talking about 5 clock cycles per loop slower (totally picked at random, who knows how expensive it is.) That's 125,000,000 clock cycles, that's .125s on a 1 GHz CPU. That could be noticable if his image processing is currently taking a second. What if it's taking a minute? Or even just 5 seconds? If so, this isn't necessarily where he should be spending his energy, and might not be worth the hit to overall readability.
Michael
+1  A: 

This would probably depend on how good your CPU's branch predictor is, and how your compiler generates the code for the switch. For such a small number of cases, it might generate a decision tree, in which case normal CPU branch prediction should be able to remove most of the overhead. Things might be a bit worse if it generates a switch table...

That said, the best way to find out is to profile it and see.

bdonlan
A: 

Depends on the chip and the compiler and the details of the code, and...but this will often be implemented as a jump table, which should be pretty fast.

BTW-- Understanding this kind of thing is a pretty good argument for spending a couple of weeks learning some assembly at some point in your career...

dmckee
A: 

Using a switch is probably better both for speed and programmer time. You're making less redundant code and it probably won't require a new stack frame.

Switches are so efficient that they can used for really weird and confusing black magic.

Matt Kane
@Matt Kane. Your example of Duff's device doesn't prove anything about the efficacy of a switch. The switch will only be evaluated once. That said, Duff's device may be an excellent way to speed up his code. And it is always fun to use.
jabbie
+13  A: 

Switch statements compile to a jump table for consecutive values and to a bunch of if-else statements for sparse values. In any case, you don't want a switch statement in your inner loop for image processing if you care about performance. You want to as below instead.

Also, note that I moved the weight calculation out of the inner loop (and swapped the loops for case 2 in order to achieve this). This type of thinking, moving stuff out of the inner loop, will get you the performance you want out of C.

switch (mode)                  /* select the type of calculation */
{
case 0:
    weight = dCentre / maxDistanceEdge;
    for (x = 0; x < width; x++) {
        for (y = 0; y < height; y++) {
             // Calculate the new pixel value given the weight
             ...
        }
    }
    break;
case 1:
    for (x = 0; x < width; x++) {
        weight = (float)x/width;
        for (y = 0; y < height; y++) {
             // Calculate the new pixel value given the weight
             ...
        }
    }
    break;
case 2:
    // note - the loops have been swapped to get the weight calc out of the inner loop
    for (y = 0; y < height; y++) {
        weight = (float)y/height;
        for (x = 0; x < width; x++) {
             // Calculate the new pixel value given the weight
             ...
        }
    }
    break;
case 3:
    weight = dBottomLeft / maxDistanceCorner;
    for (x = 0; x < width; x++) {
        for (y = 0; y < height; y++) {
             // Calculate the new pixel value given the weight
             ...
        }
    }
    break;
case 4:
    weight = dTopRight / maxDistanceCorner;
    for (x = 0; x < width; x++) {
        for (y = 0; y < height; y++) {
             // Calculate the new pixel value given the weight
             ...
        }
    }
    break;
default:
    weight = 1;
    for (x = 0; x < width; x++) {
        for (y = 0; y < height; y++) {
             // Calculate the new pixel value given the weight
             ...
        }
    }
    break;

// etc..
}
Jim Buck
I think this may be a good use for macros. #define LOOP for (x = 0; x < width; x++) for (y = 0; y < height; y++) at the start of switch, and #undef LOOP at the end of it.
Johannes Schaub - litb
If you compile with GCC the -funswitch-loops option does exactly that btw..
Nils Pipenbrinck
Except for where my edit of my answer swaps the loops for case 2. :)
Jim Buck
There's no guarantee that the cost of the switch is even his bottleneck. It'd probably be one memory read, a compare, and a jump - all of which would be in the cache and most likely accurately predicted by the CPU so we are talking an overhead on the order of a handful of clock cycles per loop. If "calculate the new pixel value given the weight" has to go out to main memory at all, it is probably already an order of magnitude greater than the switch overhead. I agree with Neil's answer above - he should try both and measure to see which is more performant.
Michael
Excellent answer, except you buried the key point. You *never* want to do inside a loop something which doesn't change (e.g., the choice of the switch; the weight 4 out of 6 times, etc)
James Curran
@James - That's a good philosophy to have, when writing new code, and when hoisting the operation out has no effect on readability or maintainability.
Michael
I tend to believe branches in loops cause branch prediction failure. But then again it's a loop and the branch happens often, so the cpu will probably predict the branch right anyway. I'm a asm n00b, can anyone tell me whether that's right thinking?
Johannes Schaub - litb
@litb - I suggest posting a new question, I can answer but it might go beyond a comment :)
Michael
Well, anyone that's done any image processing in C (*raising my own hand*), anything you can do to squeeze performance is a win. We have no idea what his weight is doing in the inner loop, but moving out the switch statement is most definitely a win no matter what. A handful of clock cycles on a 5000x5000 image adds up very quickly. Even a 10% speedup would make a huge difference when multiplied by dozens of images.
Jim Buck
@James - I moved the note about the additional optimization to the top so it doesn't visually get lost after the code. Good point.
Jim Buck
@nils: Using -O3 also includes -funswitch-loops, and more. I'd be more apt to use that.
Brian
The swapping of the loops in case 2 might do much more harm than good. You will probably now be accessing the data out of order, which might completely kill the performance. The extra division operation is probaly much cheaper (or you might precalculate all the weights). That being said: measure it.
Rasmus Faber
@Rasmus Faber: That depends on whether the image data is stored in row-major or column-major order. If column-major, then you're right. If row-major, then switching the order of the loops would be advantageous in all cases to improve locality of reference.
Adam Rosenfield
@Jim Thanks for the code, I used your method and I'm working on getting some more accurate performance measures.Thank you everyone for your help, some really interesting ideas and topics and a great deal of reading for me to do.
gav
Just want to notice that the "Jumptables" in the original answer is not correct. Many compiler including MSVC decide on the data set if they want to turn into jumptables or binary sorted cascading if's if the labels do not have a dense value range.
Lothar
A: 

In addition to Jim's advice, try swapping the order of the loops. Whether loop-swapping is ideal for case 1 would require testing, but I suspect it is. You almost always want your x coordinate inside your inner loop in order to improve paging performance, as this causes your function to have a better tendency to stay in the same general memory area each iteration. And a mobile device with limitted resources might have low enough ram that this difference will be emphasized.

Brian
+1  A: 

Switch/case is extremely fast compared to the equivalent with if/else: it is typically implemented as a jump table. However it still has a cost.

While you are optimizing things:

1) Try to loop over lines, not over columns (switch x and y "for" loops), one solution may be incredibly faster than the other, due to cache memory management.

2) Replacing all divisions by multiplications of the (pre-calculated) inverse will give you significant gain, and probably an acceptable precision loss.

Jem
A: 

but efficiency is the name of the game here.

iterating over an image buffer in order to compute new pixel values sounds like a typical embarrassingly-parallel problem, in this sense you might want to consider pushing some of the work into worker threads, this should speed up your operation more notably than micro-optimizations such as switch/case concerns.

Also, instead of doing the branching instructions every time, you could invoke a function pointer from an array of function pointers, where the index serves as your mode identifier.

So that you end up with calls such as:

computeWeight[mode](pixel);

With 5000x5000 pixels, the function call overhead could also be reduced by calling the function for a range of pixels, rather than individual pixels.

You could also use loop unrolling and parameter passing by reference/pointer, in order to optimize this further.

none
+2  A: 

For efficiency sake you better move switch outside the loop.

I'd use function pointers like this:

double fun0(void) { return dCentre/maxDistanceEdge; }
double fun1(void) { return (float)x/width; }
/* and so on ... */

double (*fun)(void);

switch (mode)                  /* select the type of calculation */
{
    case 0: fun = fun0;
            break;
    case 1: fun = fun1;
            break;
    case 2: fun = fun2;
            break;
    case 3: fun = fun3;
            break;
    case 4: fun = fun3;
            break;
    default : fun = fun_default;
            break;
}

for (x = 0; x < width; x++) {
        for (y = 0; y < height; y++) {
             weight = fun();
             // Calculate the new pixel value given the weight
             ...
        }
}

It adds function call overhead but it shouldn't be too big as you pass no params to the function. I think it is good trade-off between performance and readability.

EDIT: If you use GCC, to get rid of function call you can use goto and labels as values: find the right label within the switch and then just jump to it every time. I think it should save few more cycles.

qrdl
A: 

Many good points are already given. Only thing I could think of to add to this, is to move the most frequent cases up in the switch and the least frequent down.

So if case 4 happens more often than case 1, it should be above it:

switch (mode) {
    case 4:
        // ..
        break;
    case 1:
        // ..
        break;
}

Too bad you weren't using c++, because then the switch statement could be replaced with polymorphism.

Cheers !

Magnus Skog
A: 

Sorry to bump this thread, but it seems to me that the switch is FAR from the problem.

The real problem with the efficiency in that case is the divisions. It seems to me that all of denominators of the division operations are constants (width, height, max...) and these will not change throughout the course of the image. If my guess is right, then these are simple variables that can change based on the image loaded so that any size image can be used at runtime, now this allows for any image size to be loaded, but this also means the compiler cannot optimize them into the much simpler multiplication operation which it could do if they were declared "const". My suggestion would be to pre-calculate the inverses of these constants and multiply. As far as I can remember, the multiplication operation takes about 10 clock cycles, where as the division takes around 70. That's an increase of 60 cycles per pixel, and with the above mentioned 5000x5000, that's an estimated speed increase of 1.5 seconds on a 1 GHz CPU.

Scott Bennett