views:

137

answers:

2

I have a container similar to this one.

template <typename Nat, typename Elt>
class NatMap {
 public:
  Elt& operator[] (Nat nat) { 
    return tab [nat.GetRaw()];
  }
 private:
  Elt tab [Nat::kBound];
};

I wanted to drop the requirement for Elt to have a default constructor:

template <typename Nat, typename Elt>
class NatMap {
 public:
  Elt& operator[] (Nat nat) { 
    return ((Elt*)tab) [nat.GetRaw()];
  }
 private:
  char tab [Nat::kBound * sizeof(Elt)];
};

I use g++-4.3 and this code works 25% slower in my application than the previous one. Unfortunately the slowdown does not manifest in a synthetic benchmark. I guess it is something about compiler optimizations, aliasing, aligning, or similar stuff.

What should I do to get my performance back? (while not needing the default constructor)

Update:

Just now I tried new g++-4.4 and it gave me a following warning for the latter code:

dereferencing pointer '<anonymous>' does break strict-aliasing rules
A: 

Small suggestion: rather than trying to make educated guesses, like if the compiler optimizations are different, you could either single-step it, or find out with this unorthodox method.

Mike Dunlavey
Mike, I'm not searching for a slow part of the algorithm. Array access is essential and I want to know why in the case above I get 25% slower binary.
Łukasz Lew
In that case, single stepping each program at the instruction level is a simple way to find out. They are doing different things. Sampling will also tell you. Maybe I misunderstood you. Maybe you know what they are doing different, and you only want to know why.
Mike Dunlavey
... In fact, sometimes I step two programs side-by-side, in two instances of the IDE/debugger, so I can see just how they differ. It may be a little slow, but what the heck, so am I :-)
Mike Dunlavey
Look at the source, the programs are identical at the source level (in terms of step-by-step debugger walking) - even side by side execution will be identical. Only assembler would help, but this is impossible, as changing one line in code affects all the important parts of assembly output in chaotic manner due to whole program optimizations. So it won't help either. Running assembly side by side ... maybe, but how? even gdb can't tell which source line is responsible for which assembly line for 70% of the source. I know what I'm saying, I've been there.
Łukasz Lew
That's what I'm saying. Write two small apps, one one way, and one the other, so as to remove all other variables. The source is not identical. Then step at the assembly level. If gdb can't show the source line as well as the disassembly, get a mixed source-assembly listing and follow along in that. It's not chaotic. If the compiler scrambles the code too much, turn down the optimization. You will be able to understand what the compiler did. I've been there too.
Mike Dunlavey
This kind of question can only really be answered by doing an experiment. If anyone else can do it, you can too.
Mike Dunlavey
As I mentioned, this drop doesn't show up in small custom benchmark application. And reducing from -O3 to -O2 drops 80% of performance (mostly due to inlineing)
Łukasz Lew
Well, it may be a tough one, but I've never seen a question like this that one couldn't get to the bottom of. As I mentioned, if instruction stepping is a problem, I just take stackshots. 20 should be enough. The slow one should show about 4 samples doing something that the fast one is not doing. Then I just make sure I understand exactly what that is. Good luck.
Mike Dunlavey
+1  A: 

You may be running into alignment problems. If Elt is some size other than the native alignment type, then allocating it via placement into a character array may involve a lot of unaligned reads that you don't see when the compiler aligns it for you. Or you may be running into a problem called a load-hit-store, which some processors manifest when they write a value to memory and then read it back immediately; in those processors, it can be a stall as long as a pipeline.

Or it may be something else entirely, some kind of pathological code generation by GCC.

Unfortunately stack traces don't help track down either of these issues, as they'd just look like a load operation (lw, lb, etc) that took forty cycles instead of one. The stall is in the microcode inside the CPU, not the x86 code you've written. But looking at the assembly with the -S commandline option can help you figure out what the compiler is really emitting, and how it differs between your two implementations. Maybe there's some bad operation cropping up in one version.

Crashworks
I will look into that, but Elt is almost always int-sized. __atributte__ ((aligin)) doesn't help either.
Łukasz Lew
attribute(align) would be foiled by placement news into a char array; after all, you could tell it to place upon any arbitrary address.
Crashworks