ansaurus

Question

Implementation of string literal concatenation in C and C++

Answer 1

+3 A:

Unless the preprocessor is specified to handle this, it's safe to assume it's the compiler's job.

Edit:

Your "I.e." link at the beginning of the post answers the question:

Adjacent string literals are concatenated at compile time; this allows long strings to be split over multiple lines, and also allows string literals resulting from C preprocessor defines and macros to be appended to strings at compile time...

Cogwheel - Matthew Orlando 2010-06-29 16:22:43

However in any case it happens before converting preprocessor tokens to real tokens. For example the following yields `5` instead of a parse-error: `sizeof "12" "34"`

Johannes Schaub - litb 2010-06-29 16:29:57

@Johannes: What about `sizeof "12" "23"` has anything to do with preprocessor tokens?

David Thornley 2010-06-29 16:35:37

@David: the *sizeof* keyword takes only single argument. Give it two variables and it will complain.

Karmastan 2010-06-29 17:03:24

@Karmastan: Strings are concatenated in phase 6 of translation, and `sizeof` is evaluated in phase 7. Phase 4 is when preprocessor tokens are dealt with. `sizeof` is not a preprocessor token by the time it is evaluated.

David Thornley 2010-06-29 17:16:48

@David what I am saying is that string literals are concatenated before preprocessor tokens are converted to real token streams. Because the abstract syntax is `sizeof literal` and not `sizeof literal literal`. It's *one* non-pp token when the token-stream has been converted and is being analyzed by phase 7 in C++ and C. I'm not saying that `sizeof` is evaluated at preprocessing time.

Johannes Schaub - litb 2010-06-29 17:29:29

@Johannes: My apologies, I got confused by the "preprocessor" in "preprocessor token". Yes, the concatenation is phase 6, and the conversion of preprocessing tokens is at the start of phase 7.

David Thornley 2010-06-29 17:54:33

Answer 2

+6 A:

The standard doesn't specify a preprocessor vs. a compiler, it just specifies the phases of translation you already noted. Traditionally, phases 1 through 4 were in the preprocessor, Phases 5 though 7 in the compiler, and phase 8 the linker -- but none of that is required by the standard.

Jerry Coffin 2010-06-29 16:27:07

@Jerry, does this mean that gcc's cpp doesn't conform to this tradition of handling 1-6 in cpp?

Eli Bendersky 2010-06-29 16:31:59

@Eli: See edited/correct answer. I think it sticks pretty close to the (real) tradition.

Jerry Coffin 2010-06-29 16:40:18

Answer 3

+1 A:

In the ANSI C standard, this detail is covered in section 5.1.1.2, item (6):

5.1.1.2 Translation phases
...

4. Preprocessing directives are executed and macro invocations are expanded. ...

5. Each source character set member and escape sequence in character constants and string literals is converted to a member of the execution character set.

6. Adjacent character string literal tokens are concatenated and adjacent wide string literal tokens are concatenated.

The standard does not define that the implementation must use a pre-processor and compiler, per se.

Step 4 is clearly a preprocessor responsibility.

Step 5 requires that the "execution character set" be known. This information is also required by the compiler. It is easier to port the compiler to a new platform if the preprocessor does not contain platform dependendencies, so the tendency is to implement step 5, and thus step 6, in the compiler.

Heath Hunnicutt 2010-06-29 16:29:57

Answer 4

A:

I would handle it in the scanning token part of the parser, so in the compiler. It seems more logical. The preprocessor has not to know the "structure" of the language, and in fact it ignores it usually so that macros can generate uncompilable code. It handles nothing more than what it is entitled to handle by directives that are specifically addressed to it (# ...), and the "consequences" of them (like those of a #define x h, which would make the preprocessor change a lot of x into h)

ShinTakezou 2010-07-05 20:37:15

i.e. opening ", stuffs, a closing ", followed by "blanks", then followed by opening " the by closing " (and so on), won't cause to produce two string tokens to be then merged: it would produce a single string token directly

ShinTakezou 2010-07-05 20:39:48

Answer 5

+1 A:

There are tricky rules for how string literal concatenation interacts with escape sequences. Suppose you have

const char x1[] = "a\15" "4";
const char y1[] = "a\154";
const char x2[] = "a\r4";
const char y2[] = "al";

then x1 and x2 must wind up equal according to strcmp, and the same for y1 and y2. (This is what Heath is getting at in quoting the translation steps - escape conversion happens before string constant concatenation.) There's also a requirement that if any of the string constants in a concatenation group has an L or U prefix, you get a wide or Unicode string. Put it all together and it winds up being significantly more convenient to do this work as part of the "compiler" rather than the "preprocessor."

Zack 2010-07-16 18:00:00

ansaurus

tags:

views:

answers:

Implementation of string literal concatenation in C and C++

related questions