tags:

views:

79

answers:

2

Say you are writing a 3d renderer with support for multi-texturing, where the number of texture units is configured through a compile time constant. Opposed to hard coding a single texture unit, your'e code now has to access texture-related parameters through arrays, and handle them through loops.

Assuming a modern C++ compiler, are there any practices you'd folllow in order for the compiler to generate equivalent code to hard-coded single texture unit, when the number of texture units is set to one?

A: 

I don't know much about computer graphics, but this trivial test shows that for TEXTURE_COUNT=1 and -O1 g++ doesn't branch. I suspect this will extend even for many real life programs, but why don't you try for yourself. Use -S to see the generated assembly.

#include <stdio.h>

typedef struct fake_texture
{
    int r, g, b;
} texture;

int main()
{
    texture array[TEXTURE_COUNT] = {};
    for(int i = 0; i < TEXTURE_COUNT; i++)
    {
    array[i].r += 1;
    array[i].g += 2;
    array[i].b += 3;
    }

    for(int i = 0; i < TEXTURE_COUNT; i++)
    {
    printf("%d\n", array[i].r);
    }
}

x86 assembly excerpt:

main:
.LFB31:
        .cfi_startproc
        .cfi_personality 0x0,__gxx_personality_v0
        pushl   %ebp
        .cfi_def_cfa_offset 8
        movl    %esp, %ebp
        .cfi_offset 5, -8
        .cfi_def_cfa_register 5
        andl    $-16, %esp
        subl    $32, %esp
        movl    $1, 20(%esp)
        movl    $2, 24(%esp)
        movl    $3, 28(%esp)
        movl    $1, 8(%esp)
        movl    $.LC0, 4(%esp)
        movl    $1, (%esp)
        call    __printf_chk
        movl    $0, %eax
        leave
        ret
        .cfi_endproc
Matthew Flaschen
+1  A: 

What's wrong with loops and arrays?

Unrolling loops does have a disadvantage: It makes the code bigger. Bigger code means more memory accesses to fetch the code, and since memory access is slow, your code could end up being slower. Also, the Intel cpus preprocess the fetched code and turn them into uOps (micro ops) which are then scheduled and executed. The CPU has a cache of these uOps so it only decodes instructions that aren't already in the cache. So, an unrolled loop will fill up the cache and cause other code to bumped out. Smaller code is generally better.

As for arrays, I'm not sure how you'd get rid of them.

So, if you had:

struct TextureUnit
{
  // some texture unit data
}

TextureUnit units [number_of_units];

for (int i = 0 ; i < number_of_units ; ++i)
{
  callfunction (units [i].someparams);
}

it might be better to do:

for (TextureUnit *i = unit ; i < &unit [number_of_units] ; ++i)
{
  callfunction (i->someparams);
}

but you'd need to see what the compiler is generating in the optimised build to be sure it actually gives any advantages.

I think this might be classed as a 'micro-optimisation', so I wouldn't really be worried about it unless you can prove it really is a bottleneck. Remember - profile the code, don't just guess.

Skizz