views:

752

answers:

6

Have a look at this code:

#include <iostream>
using namespace std;

int main()
{
    const char* str0 = "Watchmen";
    const char* str1 = "Watchmen";
    char* str2 = "Watchmen";
    char* str3 = "Watchmen";

    cerr << static_cast<void*>( const_cast<char*>( str0 ) ) << endl;
    cerr << static_cast<void*>( const_cast<char*>( str1 ) ) << endl;
    cerr << static_cast<void*>( str2 ) << endl;
    cerr << static_cast<void*>( str3 ) << endl;

    return 0;
}

Which produces an output like this:

0x443000
0x443000
0x443000
0x443000

This was on the g++ compiler running under Cygwin. The pointers all point to the same location even with no optimization turned on (-O0).

Does the compiler always optimize so much that it searches all the string constants to see if they are equal? Can this behaviour be relied on?

+1  A: 

No, it can't be relied on, but storing read-only string constants in a pool is a pretty easy and effective optimization. It's just a matter of storing an alphabetical list of strings, and then outputting them into the object file at the end. Think of how many "\n" or "" constants are in an average code base.

If a compiler wanted to get extra fancy, it could re-use suffixes too: "\n" can be represented by pointing to the last character of "Hello\n". But that likely comes with very little benifit for a significant increase in complexity.

Anyway, I don't believe the standard says anything about where anything is stored really. This is going to be a very implementation-specific thing. If you put two of those declarations in a separate .cpp file, then things will likely change too (unless your compiler does significant linking work.)

Eclipse
+10  A: 

You surely should not rely on that behavior, but most compilers will do this. Any literal value ("Hello", 42, etc.) will be stored once, and any pointers to it will naturally resolve to that single reference.

If you find that you need to rely on that, then be safe and recode as follows:

char *watchmen = "Watchmen";
char *foo = watchmen;
char *bar = watchmen;
dwc
+8  A: 

It's an extremely easy optimization, probably so much so that most compiler writers don't even consider it much of an optimization at all. Setting the optimization flag to the lowest level doesn't mean "Be completely naive," after all.

Compilers will vary in how aggressive they are at merging duplicate string literals. They might limit themselves to a single subroutine — put those four declarations in different functions instead of a single function, and you might see different results. Others might do an entire compilation unit. Others might rely on the linker to do further merging among multiple compilation units.

You can't rely on this behavior, unless your particular compiler's documentation says you can. The language itself makes no demands in this regard. I'd be wary about relying on it in my own code, even if portability weren't a concern, because behavior is liable to change even between different versions of a single vendor's compiler.

Rob Kennedy
I like how you put that. For me "no optimization" just means "don't do anything that might make it tougher to debug".
T.E.D.
+5  A: 

I would not rely on the behavior, because I am doubtful the C or C++ standards would make explicit this behavior, but it makes sense that the compiler does it. It also makes sense that it exhibits this behavior even in the absence of any optimization specified to the compiler; there is no trade-off in it.

All string literals in C or C++ (e.g. "string literal") are read-only, and thus constant. When you say:

char *s = "literal";

You are in a sense downcasting the string to a non-const type. Nevertheless, you can't do away with the read-only attribute of the string: if you try to manipulate it, you'll be caught at run-time rather than at compile-time. (Which is actually a good reason to use const char * when assigning string literals to a variable of yours.)

Peter
+21  A: 

It can't be relied on, it is an optimization which is not a part of any standard.

I'd changed corresponding lines of your code to:

const char* str0 = "Watchmen";
const char* str1 = "atchmen";
char* str2 = "tchmen";
char* str3 = "chmen";

The output for the -O0 optimization level is:

0x8048830
0x8048839
0x8048841
0x8048848

But for the -O1 it's:

0x80487c0
0x80487c1
0x80487c2
0x80487c3

As you can see GCC (v4.1.2) reused first string in all subsequent substrings. It's compiler choice how to arrange string constants in memory.

+1 Nice experiment.
Thomas L Holaday
I tried with g++ v4.3.2 on Cygwin, but am not seeing the behaviour where all the pointers are offset into the same string constant.
Ashwin
@Ash: I tried once again on GCC 3.4.6, and the behaviour occurs. How do you compile the code? The behaviour occurs when optimization >=O1 is enabled.
It seems to be a problem with the platform. I tried it this time on Linux (instead of Cygwin) and saw the optimized behaviour. Thanks again for sharing this info :-)
Ashwin
+4  A: 

You shouldn't count on that of course. An optimizer might do something tricky on you, and it should be allowed to do so.

It is however very common. I remember back in '87 a classmate was using the DEC C compiler and had this weird bug where all his literal 3's got turned into 11's (numbers may have changed to protect the innocent). He even did a printf ("%d\n", 3) and it printed 11.

He called me over because it was so weird (why does that make people think of me?), and after about 30 minutes of head scratching we found the cause. It was a line roughly like this:

if (3 = x) then break;

Note the single "=" character. Yes, that was a typo. The compiler had a wee bug and allowed this. The effect was to turn all his literal 3's in the entire program into whatever happened to be in x at the time.

Anyway, its clear the C compiler was putting all literal 3's in the same place. If a C compiler back in the 80's was capable of doing this, it can't be too tough to do. I'd expect it to be very common.

T.E.D.
Wow--a DEC compiler in the mid-late '80s allowed assignments to literals?
Peter
I saw it happen. I wouldn't be surprised if there wasn't a patch the school hadn't taken or something.
T.E.D.
Thanks for sharing that tidbit from the past Ted. That was pretty cool! :-)
Ashwin