ansaurus

Question

Effectiveness of GCC optmization on bit operations

Answer 1

A:

I think you're asking a lot of your optimizer.

You might be able to help it out a little by doing a `register long z = 1L << bit;", then or-ing that with your array.

However, I assume that by 90% more time, you're meaning that the C version takes 10 cycles and the asm version takes 5 cycles, right? How does the performance compare at -O2 or -O1?

Seth 2010-01-11 02:35:48

"register" will be ignored by the compiler.

Laurynas Biveinis 2010-01-11 07:20:57

@Laurynas Biveinis: "register" may or may not be ignored by the compiler, it is a hint, after all

Hasturkun 2010-01-11 07:29:30

The idea was that doing `long |= register long` might encourage the compiler to optimize :)

Seth 2010-01-11 08:12:11

@Hasturkun: most (GCC, MSVC, ...) if not all current compilers will ignore it for sure.

Laurynas Biveinis 2010-01-11 08:31:54

Answer 2

+2 A:

Can you post the code that you are using to do the timing? This sort of operation can be tricky to time accurately.

In theory the two code sequences should be equally fast, so the most likely explanation (to my mind) is that something is causing your timing code to give bogus results.

Stephen Canon 2010-01-11 03:15:40

Yes. On all x86 CPUs I'm aware of, a basic ALU operation like `OR` is at least as fast as "exotic" ops like `BTS`. One possibility is that the increment+test in your timing loop uses the same CPU execution unit as the `OR`, leading to contention, while a `BTS` uses a different unit.

j_random_hacker 2010-01-11 04:27:59

On recent x86 hardware, `or` can be dispatched onto more than one execution unit, so there shouldn't be any contention there.

Stephen Canon 2010-01-11 04:51:55

Posted - thanks for your help.

Dumb Guy 2010-01-11 06:03:49

Answer 3

+9 A:

Why does GCC optimize so poorly for such a common operation?

Prelude: Since the late 1980s, focus on compiler optimization has moved away from microbenchmarks which focus on individual operations and toward macrobenchmarks which focus on applications whose speed people care about. These days most compiler writers are focused on macrobenchmarks, and developing good benchmark suites is something that is taken seriously.

Answer: Nobody on the gcc is using a benchmark where the difference between or and bts matters to the execution time of a real program. If you can produce such a program, you might be able to get the attention of people in gcc-land.

Am I doing something wrong with the C version?

No, this is perfectly good standard C. Very readable and idiomatic, in fact.

Norman Ramsey 2010-01-11 03:58:30

Answer 4

A:

For such code:

#include <stdio.h>
#include <time.h>

int main() {
  volatile long long i = 0;
  time_t start = time (NULL);
  for (long long n = 0; n < (1L << 32); n++) {
    i |= 1 << 10;
  }
  time_t end = time (NULL);
  printf("C took %ds\n", (int)(end - start));
  start = time (NULL);
  for (long long n = 0; n < (1L << 32); n++) {
    __asm__ ("bts %[bit], %[i]"
                  : [i] "=r"(i)
                  : "[i]"(i), [bit] "i" (10));
  }
  end = time (NULL);
  printf("ASM took %ds\n", (int)(end - start));
}

the result was:

C took 12s
ASM took 10s

My flag was (-std=gnu99 -O2 -march=core2). Without the volatile the loop was optimized out. gcc 4.4.2.

No difference was with:

__asm__ ("bts %[bit], %[i]"
              : [i] "+m"(i)
              : [bit] "r" (10));

So probably the answer was - noone cares. In microbenchmark the only difference is the one between those two methods but in real life I belive such code does not take much CPU.

Additionally for such code:

#include <stdio.h>
#include <time.h>

int main() {
  volatile long long i = 0;
  time_t start = time (NULL);
  for (long long n = 0; n < (1L << 32); n++) {
    i |= 1 << (n % 32);
  }
  time_t end = time (NULL);
  printf("C took %ds\n", (int)(end - start));
  start = time (NULL);
  for (long long n = 0; n < (1L << 32); n++) {
    __asm__ ("bts %[bit], %[i]"
                  : [i] "+m"(i)
                  : [bit] "r" (n % 32));
  }
  end = time (NULL);
  printf("ASM took %ds\n", (int)(end - start));
}

The result was:

C took 9s
ASM took 10s

Both results were 'stable'. Testing CPU 'Intel(R) Core(TM)2 Duo CPU T9600 @ 2.80GHz'.

Maciej Piechotka 2010-03-06 12:18:55

ansaurus

tags:

views:

answers:

Effectiveness of GCC optmization on bit operations

related questions