views:

960

answers:

5

I have a simple function with an inner loop - it scales the input value, looks up an output value in a lookup table, and copies it to the destination. (ftol_ambient is a trick I copied from the web for fast conversion of float to int).

for (i = 0;  i < iCount;  ++i)
{
    iScaled = ftol_ambient(*pSource * PRECISION3);
    if (iScaled <= 0)
        *pDestination = 0;
    else if (iScaled >= PRECISION3)
        *pDestination = 255;
    else
    {
        iSRGB = FloatToSRGBTable3[iScaled];
        *pDestination = iSRGB;
    }
    pSource++;
    pDestination++;
}

Now my lookup table is finite, and floats are infinite, so there's a possibility of off-by-one errors. I created a copy of the function with some code to handle that case. Notice that the only difference is the added 2 lines of code - please ignore the ugly pointer casting.

for (i = 0;  i < iCount;  ++i)
{
    iScaled = ftol_ambient(*pSource * PRECISION3);
    if (iScaled <= 0)
        *pDestination = 0;
    else if (iScaled >= PRECISION3)
        *pDestination = 255;
    else
    {
        iSRGB = FloatToSRGBTable3[iScaled];
        if (((int *)SRGBCeiling)[iSRGB] <= *((int *)pSource))
            ++iSRGB;
        *pDestination = (unsigned char) iSRGB;
    }
    pSource++;
    pDestination++;
}

Here's the strange part. I'm testing both versions with identical input of 100000 elements, repeated 100 times. On my Athlon 64 1.8 GHz (32 bit mode), the first function takes 0.231 seconds, and the second (longer) function takes 0.185 seconds. Both functions are adjacent in the same source file, so there's no possibility of different compiler settings. I've run the tests many times, reversing the order they're run in, and the timings are roughly the same every time.

I know there's a lot of mystery in modern processors, but how is this possible?

Here for comparison are the relevant assembler outputs from the Microsoft VC++6 compiler.


; 173  :  for (i = 0;  i < iCount;  ++i)

$L4455:

; 174  :  {
; 175  :   iScaled = ftol_ambient(*pSource * PRECISION3);

 fld DWORD PTR [esi]
 fmul DWORD PTR __real@4@400b8000000000000000
 fstp QWORD PTR $T5011[ebp]

; 170  :  int i;
; 171  :  int iScaled;
; 172  :  unsigned int iSRGB;

 fld QWORD PTR $T5011[ebp]

; 173  :  for (i = 0;  i < iCount;  ++i)

 fistp DWORD PTR _i$5009[ebp]

; 176  :   if (iScaled <= 0)

 mov edx, DWORD PTR _i$5009[ebp]
 test edx, edx
 jg SHORT $L4458

; 177  :    *pDestination = 0;

 mov BYTE PTR [ecx], 0

; 178  :   else if (iScaled >= PRECISION3)

 jmp SHORT $L4461
$L4458:
 cmp edx, 4096    ; 00001000H
 jl SHORT $L4460

; 179  :    *pDestination = 255;

 mov BYTE PTR [ecx], 255   ; 000000ffH

; 180  :   else

 jmp SHORT $L4461
$L4460:

; 181  :   {
; 182  :    iSRGB = FloatToSRGBTable3[iScaled];
; 183  :    *pDestination = (unsigned char) iSRGB;

 mov dl, BYTE PTR _FloatToSRGBTable3[edx]
 mov BYTE PTR [ecx], dl
$L4461:

; 184  :   }
; 185  :   pSource++;

 add esi, 4

; 186  :   pDestination++;

 inc ecx
 dec edi
 jne SHORT $L4455


$L4472:

; 199  :  {
; 200  :   iScaled = ftol_ambient(*pSource * PRECISION3);

 fld DWORD PTR [esi]
 fmul DWORD PTR __real@4@400b8000000000000000
 fstp QWORD PTR $T4865[ebp]

; 195  :  int i;
; 196  :  int iScaled;
; 197  :  unsigned int iSRGB;

 fld QWORD PTR $T4865[ebp]

; 198  :  for (i = 0;  i < iCount;  ++i)

 fistp DWORD PTR _i$4863[ebp]

; 201  :   if (iScaled <= 0)

 mov edx, DWORD PTR _i$4863[ebp]
 test edx, edx
 jg SHORT $L4475

; 202  :    *pDestination = 0;

 mov BYTE PTR [edi], 0

; 203  :   else if (iScaled >= PRECISION3)

 jmp SHORT $L4478
$L4475:
 cmp edx, 4096    ; 00001000H
 jl SHORT $L4477

; 204  :    *pDestination = 255;

 mov BYTE PTR [edi], 255   ; 000000ffH

; 205  :   else

 jmp SHORT $L4478
$L4477:

; 206  :   {
; 207  :    iSRGB = FloatToSRGBTable3[iScaled];

 xor ecx, ecx
 mov cl, BYTE PTR _FloatToSRGBTable3[edx]

; 208  :    if (((int *)SRGBCeiling)[iSRGB] <= *((int *)pSource))

 mov edx, DWORD PTR _SRGBCeiling[ecx*4]
 cmp edx, DWORD PTR [esi]
 jg SHORT $L4481

; 209  :     ++iSRGB;

 inc ecx
$L4481:

; 210  :    *pDestination = (unsigned char) iSRGB;

 mov BYTE PTR [edi], cl
$L4478:

; 211  :   }
; 212  :   pSource++;

 add esi, 4

; 213  :   pDestination++;

 inc edi
 dec eax
 jne SHORT $L4472


Edit: Trying to test Nils Pipenbrinck's hypothesis, I added a couple of lines before and inside of the loop of the first function:

int one = 1;
int two = 2;

        if (one == two)
            ++iSRGB;

The run time of the first function is now down to 0.152 seconds. Interesting.


Edit 2: Nils pointed out that the comparison would be optimized out of a release build, and indeed it is. The changes in the assembly code are very subtle, I will post it here to see if it provides any clues. At this point I'm wondering if it's code alignment?

; 175  :  for (i = 0;  i < iCount;  ++i)

$L4457:

; 176  :  {
; 177  :   iScaled = ftol_ambient(*pSource * PRECISION3);

 fld DWORD PTR [edi]
 fmul DWORD PTR __real@4@400b8000000000000000
 fstp QWORD PTR $T5014[ebp]

; 170  :  int i;
; 171  :  int iScaled;
; 172  :  int one = 1;

 fld QWORD PTR $T5014[ebp]

; 173  :  int two = 2;

 fistp DWORD PTR _i$5012[ebp]

; 178  :   if (iScaled <= 0)

 mov esi, DWORD PTR _i$5012[ebp]
 test esi, esi
 jg SHORT $L4460

; 179  :    *pDestination = 0;

 mov BYTE PTR [edx], 0

; 180  :   else if (iScaled >= PRECISION3)

 jmp SHORT $L4463
$L4460:
 cmp esi, 4096    ; 00001000H
 jl SHORT $L4462

; 181  :    *pDestination = 255;

 mov BYTE PTR [edx], 255   ; 000000ffH

; 182  :   else

 jmp SHORT $L4463
$L4462:

; 183  :   {
; 184  :    iSRGB = FloatToSRGBTable3[iScaled];

 xor ecx, ecx
 mov cl, BYTE PTR _FloatToSRGBTable3[esi]

; 185  :    if (one == two)
; 186  :     ++iSRGB;
; 187  :    *pDestination = (unsigned char) iSRGB;

 mov BYTE PTR [edx], cl
$L4463:

; 188  :   }
; 189  :   pSource++;

 add edi, 4

; 190  :   pDestination++;

 inc edx
 dec eax
 jne SHORT $L4457
+4  A: 

My first guess is that the branch is being predicted better in the second case. Possibly because the nested if gives whatever algorithm the processor's using more information to guess from. Just out of curiousity, what happens when you remove the line

if (((int *)SRGBCeiling)[iSRGB] <= *((int *)pSource))

?

Dan O
Commenting out that line makes it roughly the same speed as the other code, within the variation I see in the timing results.
Mark Ransom
How do you think a well predicted branch can be faster than no branch at all?
Mark Ransom
/Two/ well predicted branches faster than /one/ badly predicted one. (Or some other numbers; there are 3 branches in the first listing and 4 in the second)
aib
A: 

Are you just testing this inner loop, or are you testing your undisclosed outer loop as well? If so, look at these three lines:

if (((int *)SRGBCeiling)[iSRGB] <= *((int *)pSource))  
    ++iSRGB;
*pDestination = (unsigned char) iSRGB;

Now, it looks like *pDestination is the counter for the outer loop. So by sometimes doing an extra increment of the iSRGB value you get to skip some of the iterations in the outer loop, thereby reducing the total amount of work the code needs to do.

Joel Coehoorn
I included the loop statement in the samples; i is the loop counter. Although it's assigned to different registers in the assembly output, I don't see how it would make a difference. The ++iSRGB condition is only hit 3% of the time.
Mark Ransom
I see you clarified your answer. No, my outer loop is just as simple as the inner one: for (int i = 0; i < repeats; ++i)
Mark Ransom
+10  A: 

My guess is, that in the first case two different branches end up in the same branch-prediction slot on the CPU. If these two branches predict different each time the code will slow down.

In the second loop, the added code may just be enough to move one of the branches to a different branch prediction slot.

To be sure you can give the Intel VTune analyzer or the AMD CodeAnalyst tool a try. These tools will show you what's exactly going on in your code.

However, keep in mind that it's most probably not worth to optimize this code further. If you tune your code to be faster on your CPU it may at the same time become slower on a different brand.


EDIT:

If you want to read on the branch-prediction give Agner Fog's excellent web-site a try: http://www.agner.org/optimize/

This pdf explains the branch-prediction slot allocation in detail: http://www.agner.org/optimize/microarchitecture.pdf

Nils Pipenbrinck
Of course it's not worth it, why should I complain that the better code is faster? But this is my hobby code, and I hate not understanding what's going on. P.S. good thinking on the branch-prediction slot, I'll try to make up a test for that.
Mark Ransom
See my edit to the question. And thanks for the suggestion on AMD CodeAnalyst - I'll have to try it if I can find the time. Doesn't look like I can use the integrated version though.
Mark Ransom
+1  A: 

How are you timing these routines? I wonder if paging or caching is having an effect on the timings? It's possible that calling the first routine loads both into memory, crosses a page boundary or causes the stack to cross into an invalid page (causing a page-in), but only the first routine pays the price.

You may want to to run through both functions once before making the calls that take the measurements to reduce the effects that virtual memory and caching might have.

Michael Burr
I thought of that, but the loop is 100000, and it's being repeated 100 times. Plus I swapped the order of the two functions. I'm using QueryPerformanceCounter for timing.
Mark Ransom
A: 

I once had a similar situation. I hoisted some code out of a loop to make it faster, but it got slower. Confusing. Turns out, the average number of times though the loop was less than 1.

The lesson (which you don't need, obviously) is that a change doesn't make your code faster unless you measure it actually running faster.

Mark Gilbert