views:

209

answers:

4

Hello,

I witnessed the following weird behavior. I have two functions, which do almost the same - they measure the number of cycles it takes to do a certain operation. In one function, inside the loop I increment a variable; in the other nothing happens. The variables are volatile so they won't be optimized away. These are the functions:

unsigned int _osm_iterations=5000;

double osm_operation_time(){
    // volatile is used so that j will not be optimized, and ++ operation
    // will be done in each loop
    volatile unsigned int j=0;
    volatile unsigned int i;
    tsc_counter_t start_t, end_t;
    start_t = tsc_readCycles_C();
    for (i=0; i<_osm_iterations; i++){
       ++j;
    }
    end_t = tsc_readCycles_C();
    if (tsc_C2CI(start_t) ==0 || tsc_C2CI(end_t) ==0 || tsc_C2CI(start_t) >= tsc_C2CI(end_t))
         return -1;
    return (tsc_C2CI(end_t)-tsc_C2CI(start_t))/_osm_iterations;
}

double osm_empty_time(){
    volatile unsigned int i;
    volatile unsigned int j=0;
    tsc_counter_t start_t, end_t;
    start_t = tsc_readCycles_C();
    for (i=0; i<_osm_iterations; i++){
        ;
    }
    end_t = tsc_readCycles_C();
    if (tsc_C2CI(start_t) ==0 || tsc_C2CI(end_t) ==0 || tsc_C2CI(start_t) >= tsc_C2CI(end_t))
        return -1;
    return (tsc_C2CI(end_t)-tsc_C2CI(start_t))/_osm_iterations;
}

There are some non-standard functions there but I'm sure you'll manage.

The thing is, the first function returns 4, while the second function (which supposedly does less) returns 6, although the second one obviously does less than the first one.

Does that make any sense to anyone?

Actually I made the first function so I could reduce the loop overhead for my measurement of the second. Do you have any idea how to do that (as this method doesn't really cut it)?

I'm on Ubuntu (64 bit I think).

Thanks a lot.

+4  A: 

I can see a couple of things here. One is that the code for the two loops looks identical. Secondly, the compiler will probably realise that the variable i and the variable j will always have the same value and optimise one of them away. You should look at the generated assembly and see what is really going on.

Another theory is that the change to the inner body of the loop has affected the cachability of the code - this could have moved it across cache lines or some other thing.

Since the code is so trivial, you may find it difficult to get an accuate timing value, even if you are doing 5000 iterations, you may find that the time is inside the margin for error for the timing code you are using. A modern computer can probably run that in far less than a millisecond - perhaps you should increase the number of iterations?

To see the generated assembly in gcc, specify the -S compiler option:

Q: How can I peek at the assembly code generated by GCC?

Q: How can I create a file where I can see the C code and its assembly translation together?

A: Use the -S (note: capital S) switch to GCC, and it will emit the assembly code to a file with a .s extension. For example, the following command:

gcc -O2 -S -c foo.c

will leave the generated assembly code on the file foo.s.

If you want to see the C code together with the assembly it was converted to, use a command line like this:

gcc -c -g -Wa,-a,-ad [other GCC options] foo.c > foo.lst

which will output the combined C/assembly listing to the file foo.lst.

1800 INFORMATION
The first thing is just a copy mistake. About the second one - I don't really know how to do it... any tips?
R S
And also - why would it optimize them away anyway if they are defined volatile?
R S
I don't really know if it has or not, but since they are stack allocated local variables and you never take the address of them or other weird things, maybe the compiler can tell that they cannot ever be touched by some external source
1800 INFORMATION
Generally your compiler has options to view the generated assembly - I don't know what yours would be
1800 INFORMATION
It's gcc. Thanks :)
R S
More iterations don't fix it :)
R S
Then in that case, you should definitely look at the generated compiler output. You could be left scratching your head though, if your real code is as simple as this.
1800 INFORMATION
The volatile qualifiers mean that the optimizer cannot tell what is in i or j; there could apparently be something else that changes them that the compiler cannot detect.
Jonathan Leffler
A: 

It's sometimes difficult to guess at this sort of thing, especially due to the small number of iterations. One thing that might be happening, though, is the increment could be executing on a free integer execution unit, gaining some slight degree of parallelism, since it has no dep on the value of i.

Since you mentioned this was 64 bit os, it's almost certain all these values are in registers, since there's more registers in the x86_64 architecture. Other than that, i'd say perform many more iterations, and see how stable the results are.

Brian Mitchell
A: 

If you are truly trying to test the operation of a piece of code ("j++;" in this case), you're actually better off doing the following:

1/ Do it in two separate executables since there is a possibility that position within the executable may affect the code.

2/ Make sure you use CPU time rather than elapsed time (I'm not sure what "tsc_readCycles_C()" gives you). This is to avoid errant results from a CPU loaded up with other tasks.

3/ Turn off compiler optimization (e.g., "gcc -O0") to ensure gcc doesn't put in any fancy stuff that's likely to skew the results.

4/ You don't need to worry about volatile if you use the actual result, such as placing:

printf ("%d\n",j);

after the loop, or:

FILE *fx = fopen ("/dev/null","w");
fprintf (fx, "%d\n", j);
fclose (fx);

if you don't want any output at all. I can't remember whether volatile was a suggestion to the compiler or enforced.

5/ Iterations of 5,000 seem a little on the low side, where "noise" could affect the readings. Maybe a higher value would be better. This may not be an issue if you're timing a larger piece of code and you've just included "j++;" as a place-holder.

paxdiablo
A: 

When I'm running tests similar to this, I normally:

  1. Ensure that the times are measured in at least seconds, preferably (small) tens of seconds.
  2. Have a single run of the program call the first function, then the second, then the first again, then the second again, and so on, just to see if there are weird cache warmup issues.
  3. Run the program multiple times to see how stable the timing is across runs.

I'm still at a loss to explain your observed results, but if you're sure you've got your functions identified properly (not self-evidently the case given that there were copy'n'paste errors earlier, for example), then looking at the assembler output is the main option left.

Jonathan Leffler