views:

1239

answers:

16

Hello, I have a performance issue in a bottleneck section in my code. Basically it's a simple nested loop.

Profiling the issue reveals that the program spends a lot of time just incrementing both of the loop counters (++) and testing for termination (i/j < 8).

Watching the assembly output I see that both counters don't get registers and accessing them costs a lot of cycles. Using the "register" keyword doesn't convince the compiler to actually put them in registers. is there something which can be done to to optimize the counters access time?

Here's the assembly output. The C source is just a simple nested loop with i/j counters.

  2738  0.2479  2459  0.1707   :    1e6c:   jne    1dd1 <process_hooks+0x121>
  1041  0.0942  1120  0.0778   :    1e72:   addl   $0x1,0xffffffd4(%ebp)
  2130  0.1928  2102  0.1459   :    1e76:   cmpl   $0x8,0xffffffd4(%ebp)
  2654  0.2403  2337  0.1622   :    1e7a:   jne    1da0 <process_hooks+0xf0>
  809   0.0732   814  0.0565   :    1e80:   jmp    1ce2 <process_hooks+0x32>

As requested, here's the C code as well. Compiler is gcc btw:

for (byte_index=0; byte_index < MASK_SIZE / NBBY; byte_index++)
{
    if (check_byte(mask,byte_index))
    {
        for (bit_index=0; bit_index < NBBY; bit_index++)
        {
            condition_index = byte_index*NBBY + bit_index;
            if (check_bit(condition_mask,condition_index))
            {
                .
                .
                .
            }
        }
    }
}

Thanks

+7  A: 

When getting a performance bottle neck in a loop counter you should consider unrolling the loop.

EDIT: As always, when optimizing, make sure you benchmark and convince yourself that you get the desired result.

Laserallan
unrolling loops too far is actually bad for performance. Since it might cause the whole loop (or other bits) not to fit in the cache anymore.
Toad
The compiler will normally try to unroll the loop if it thinks it will improve performance.
Geerad
If it *might* cause the loop to not fit into the cache, then it *might* be bad for performance. ,,is actually bad'' is an overstatement. Measure before you optimize.
Tadeusz A. Kadłubowski
C compilers sometimes can't unroll as aggressive as it should because of pointer aliasing. This means that some manual work can make a difference sometimes.
Laserallan
The OP says above in the comments that he's only using -O2, so the loop just isn't getting unrolled. Use -O3 to get unrolling.
tgamblin
A: 

You can try unrolling the loop. The compiler might do this for you, but if not, and you really need the performance, do it yourself. I assume you're doing something like calling function(.., i, j, ..) each iteration, so just replace the loops with:

function(.., 0, 0, ..)
...
function(.., 0, 7, ..)
function(.., 1, 0, ..)
...
function(.., 7, 7, ..)

With some more context (C source), there might be more helpful things to do. Frankly though, I would be shocked if 2 stack-allocated counters (many modern processors have special accelerator hardware to access the top bit of the stack nearly as quickly as registers) cause a noticeable problem in a non-toy program.

Matt J
calling a function inside a loop is bad performance right there.If you can, put the contents of that function inside the loop. It will save an enormous amount of time (more than the loop unroll which like I said before, might really be dreadful for performance due to cache misses)
Toad
true, inlining and unrolling+inlining are viable solutions as well. in my experience, it seems as though the compiler is more likely to do inlining for you than unrolling, so unrolling in your C code is moderately likely to yield unrolling+inlining in the compiled code, assuming the function is small enough not to cause ghastly numbers of cache misses.
Matt J
+2  A: 

This page suggests that "the register keyword is a somewhat antiquated procedure since for quite a long time the optimizer in modern compilers are smart enough to detect when storing a variable on the register will be advantageous and will do so during optimization. There for, suggesting to the compiler to store a variable on the register can only make things slower if used incorrectly".

I'm guessing that this is largely dependent on your compiler and optimization level. As others have said, this may be a good candidate for -funroll-all-loops (gcc).

anthony
All modern C++ compilers that I know ignore `register` outright. Given the tricks that they pull with register allocation, it is entirely understandable.
Pavel Minaev
+5  A: 

The best results (speed wise) I get when using the intel compiler.

You are right in saying that the 'register' keyword is just meant as a hint for the compiler (just like inline).

If you really think this loop is a major bottleneck, just type raw assembly. I know it's hardly portable, but then again, usually this doesn't matter much, and if it should be portable... it is only in 1 specific place.

you can even #ifdef the whole bit with the original C code to maintain portability

Toad
+4  A: 

You need to be sure this is a bottleneck, On modern processors where instructions are pulled apart and parts of instructions are executed out of order, and with caches and lookaside buffers, it's entirely possible that this isn't any slower.

John Burton
+1  A: 

I hope these 2 functions are inlined (check_bit and check_byte) since they are much slower than any register variable might make your loop.

if the compiler doesn't inline them, inline them yourself into the loop.

Toad
They are inlined. they are simple bitwise operations and they don't use local variables. I do call a function inside the loop which I can't inline because it's too heavy. I checked the overhead and it was negligible compared to the loop counters.
Nir
A function call is much heavier than any loop counter. Why? It has to save variables on the stack, (possibly) break the pipeline, (possibly) go out of cache, and restore the variables again. The counter issue is mere peanuts compared to this.
Toad
Furthermore, try to see if the multiplication is translated into a bitshift by the compiler. Since NBBY is probably 8, 16 or 32 than thisis possible. Same goes for the division.
Toad
Last but not least... instead of constantly calculating 'condition_index'. Just increment it every loop.
Toad
Toad
@reiner: you might want to edit your answer instead of commenting on it, makes it a nicer read.. :)
roe
A: 

If the compiler tried to put the counters into registers, the registers would have to be saved and restored for every function call inside your loop (probably depending on where those functions are defined). Inlining the functions should speed up things considerably (if this is really your bottleneck).

groovingandi
+12  A: 

There are two possible reasons it doesn't get put in a register:

The variable needs to be kept in memory

If you are taking the address of the variable or declaring it volatile, it wont be kept in a register. It doesn't look like you are doing that, but it might happen in the ... section.

gcc is doing a bad job of register allocation.

This is quite likely. gcc seems to have a poor allocator (based on the comments of its developers). In addition, register allocation is fickle and hard to reason about. You will probably be able to tweak it to get some benefits using register allocator optimizations. If you like, you can set them for that function only.

gcc 4.4 has a new register allocator, which is supposed to be better, but also allows you to choose the allocation algorithm. That will provide extra things to tweak.

You can also try telling gcc to try harder, with the hot attribute.

Finally, you can also tweak things using gcc's --param flags. They expose internal compiler settings, so this should probably not be embarked upon lightly.

Paul Biggar
>> gcc is doing a bad job of register allocation.Or possibly it's doing a good job of register allocation and using the registers for other things instead/
John Burton
Could be - hard to say. But since gcc has a very poor register allocator, that wouldn't be my first guess.
Paul Biggar
+5  A: 
    for (bit_index=0; bit_index < NBBY; bit_index++)
    {
        condition_index = byte_index*NBBY + bit_index;
        if (check_bit(condition_mask,condition_index))
        {
            .
            .
            .
        }
    }

could just as easily be;

    condition_index = byte_index * NBBY;
    for (bit_index=0; bit_index < NBBY; bit_index++, condition_index++)
    {
        if (check_bit(condition_mask,condition_index))
        {
            .
            .
            .
        }
    }

I'm a fan of keeping calculations in the correct scope. You have all the info for this one in the outer loop yet choose to make it in the inner loop. The new loop is slightly messier but this can be avoided, and now it's more likely that your compiler will do things correctly. (It probably did before, but one can't be sure without checking the assembly.)

Speaking of the correct scope there's no reason to declare your loop counters outside of the loop. This C style has been outdated for years and while it probably isn't specifically a performance disadvantage, restricting things to the smallest logical scope leads to cleaner and more maintainable code.

For 8 bits you could likely unroll, but depending on your hardware it might not work very well. There are a lot of other ways you could do this too, I probably missed a few looking over it. In hardware I've worked with conditionals within loops are generally poison to performance but I don't see any obvious way to avoid it here. I'd certainly consider iterating over bits rather than bytes in the outer loop to avoid a multiply in the common case. Just suggesting this... I'm thinking in this case there wouldn't be a clear advantage.

Dan Olson
Could you explain how this will help optimize his code?As far as I can tell, it will actually take an *extra* register, unless the compiler moves it back in (which defies the purpose).
Paul Biggar
Assuming the compiler cannot do it for you (which it should), the second code sample avoids recalculating byte_index*NBBY each loop iteration.
Matt J
@Matt J: But its been determined that that is not the bottleneck.
Paul Biggar
Why would taking an extra register be a bad thing? Perhaps it is not a bottleneck, but it's hard to tell what the compiler is doing. And putting the calculation in the more logical scope makes the loop easier to unroll. I'm just shooting in the dark here, throwing out advice and it can all be taken individually for what it's worth.
Dan Olson
He said it in the question. Quote: "Watching the assembly output I see that both counters don't get registers and accessing them costs a lot of cycles. Using the "register" keyword doesn't convince the compiler to actually put them in registers. is there something which can be done to to optimize the counters access time?"
Paul Biggar
A: 

Assuming the profiling information is correct and it indeed is the increment operations that cause a bottleneck, you could possibly abuse some out-of-order execution:

for (byte_index = 0; byte_index < MASK_SIZE / NBBY; )
{
    if (check_byte(mask,byte_index++))
    {
        condition_index = byte_index*NBBY;
        for (bit_index=0; bit_index < NBBY; )
        {
            if (check_bit(condition_mask,condition_index + bit_index++))
            {
                ...
            }
        }
    }
}

(the above snippet won't work for obvious reasons, but you should get the idea :)

Also, from the function/macro names in your C snippet, I assume you're working with bit masks to do stuff. One thing that has helped me improve performance earlier is iterating over an array of masks rather than doing dynamic calculations on the input, i.e. something like

for (byte_index = 0; byte_index < MASK_SIZE / NBBY; byte_index++)
{
    if (check_byte(mask,byte_index))
    {
        const char masks[] = { 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80 };
        for (mask_index=0; mask_index < sizeof(masks) / sizeof(masks[0]); mask_index++)
        {
            if (check_bit(masks[mask_index], byte_index))
            {
                ...
            }
        }
    }
}

...which the compiler might have a better chance of optimizing/unrolling properly.

Christoffer
bit shifting the initial mask is much faster than everytime a costly lookup from an array.
Toad
Well, of course. But I've seen more than once that "simple" code gives the compiler a better basis for optimization, especially when talking about loops -- you don't do vectorization by hand in C code.
Christoffer
+1  A: 

You should change your design, the inner loop shouldn't exist in the first place - you should avoid working with bits, transform your bit checks to a single byte check. I can't tell you exactly how, as this is based on the type of check you do, but I assume a loop up table shell be involved.

EDIT: Another thing to consider, if you really want to make a part of code faster, you may use special CPU instructions, your compiler will probably be clueless on when to use them. For example on Intel, there are many instructions that can be used, up to the SSE4 and more, this is really where you can perform better then your compiler, as it has no way of knowing what you want to achieve at algorithm level. Check out Intel(R) 64 and IA-32 Architectures Software Developer's Manual for details. Also at this level you may benefit from better control over the pipeline.

If you don't want to write assembly, There are sometimes wrapping functions for the instructions, to be used in 'C'.

Regarding checking whether a bit is on or not: Not sure what you want to do if a bit is on, but (assuming your bits are byte aligned):

Let's assume you'll get the byte 0110 0110 on X. You will want to do some stuff, maybe print a massage like "Bits 1,2,5,6 are on". You can create 256 functions, each will do something like displaying this kind of massage. How would you know which one to activate? The function number shell be exactly the value of the byte received, so you can simply use the [] operator to go there. it will be a table of pointers to functions however. It should look something like this:

//define the functions
void func0()
{
   printf("No Bits are on.");
}

void func1()
{
   printf("Bit 0 is on.");
}
.
.
.

//create the table
void  (*table[256])();
table[0] = &func0;
table[1] = &func1;
.
.
.

//the for loop
void  (*pointer_to_func)();
for...
{
   X = getByte();
   pointer_to_func = table[X]; //table shell contain 256 function pointers.
   pointer_to_func(); //call the function
}

this should call the function in the X position, and execute it, I assume the function at location X == 102 (the decimal of 0110 0110) will be something like:

printf("Bits 1,2,5,6 are on");

See the The Function Pointer Tutorials , Specifacly this.

Liran Orevi
Why will that help? It won't change the number of checks I'll have.Also, it will use up valuable stack space.
Nir
It should reduce 8 bit-checks, which are really expensive bit-manipulating byte checks, to a single byte-check. the problem here rises directly from the fact that most modern CPU's don't work well on bit-level, thus you should always aspire to work at least at byte level, if not at bigger data objects. I would be happy to further assist should you describe "Bit-Check".(Your CPU doesn't support direct bit manipulation, right?, because if it does then my suggestion is much less useful.)
Liran Orevi
Nir
@Nir, edited my answer, any feedbacks are welcome.
Liran Orevi
A: 

Without seeing what is in the inner loop, there's no point in trying to optimize the loops. It looks like that the code is generated for x86 32bit. If the calculation in the loop needs several register, there is no point for the compiler to hold the loop counter in registers, as it will have to spill them anyway to the stack. Then depending on the instructions used in the inner loop there can be quite some problems with the register allocation. Shifts use the ECX register only as count, multiplication and division have restrictions concerning the registers used, string commands use ESI and EDI as registers, reducing the opportunity for the compiler to hold values in them. And as other have already stated, the call in the middle of the loop does not help either, as the registers have to be saved anyway.

tristopia
+1  A: 

You could try refactoring it down to one index and see if that changes the compiler's mind:

for (condition_index = 0; condition_index < MASK_SIZE;)
{
    if (check_byte(mask, condition_index / NBBY))
    {
        for (bound = condition_index + NBBY; condition_index < bound; condition_index++)
        {
            if (check_bit(condition_mask, condition_index))
            {
                /* stuff */
            }
        }
    }
    else
    {
        condition_index += NBBY;
    }
}

(Hopefully NBBY is a power of 2 so the divide will be implemented as a shift)

caf
Tried it. The counter is still using the base pointer rather than a register but it actually makes things a bit better. (5% better).I'm not sure why, but can't argue with results...So thanks!
Nir
Hmm, it must be the function call within /* stuff */. If that's not happening very often, you could try wrapping check_bit() up as __builtin_expect(check_bit(...), 0) to indicate to the compiler that it's usually false.
caf
+1  A: 

If it is true that the /* stuff */ code within the inner if() is only rarely executed (and there is an a priori known bound on the number of times it can happen or at least a reasonable limit), it could well be a performance improvement to change to a two-pass solution. This will eliminate the register pressure in the nested loops. The following is based on my previous single-index answer:

for (n = 0, condition_index = 0; condition_index < MASK_SIZE;)
{
    if (check_byte(mask, condition_index / NBBY))
    {
        for (bound = condition_index + NBBY; condition_index < bound; condition_index++)
        {
            if (check_bit(condition_mask, condition_index))
            {
                condition_true[n++] = condition_index;
            }
        }
    }
    else
    {
        condition_index += NBBY;
    }
}

do {
    condition_index = condition_true[--n];
    /* Stuff */
} while (n > 0);
caf
A: 

I would try to look at the problem differently.

What exactly is the code doing, perhaps in explaining what it does, a different algorithm can be used that is more efficient?

For example, I often see code that iterates over large lists of items that can be made much faster by splitting the list into two linked lists, one of "active" items and one of "non-active" items and then have code that moves items from one list to the other as items are allocated and free'd. I think that would give you the best results.

KPexEA
A: 

This may seem like a minor point, but instead of using the form: index++ , use ++index;

The rationale is that index++ requires cacheing the current rvalue before incrementing, whereas ++index returns the newly calculated rvalue, which should already be cached, thus saving a reference.

Of course, a good compiler will optimize this out, so that it's probably not an issue.

dar7yl