views:

833

answers:

6

In C, strings are terminated with null ( \0 ) which causes problems when you want to put a null in a strings. Why not have a special escaped character such as \$ or something?

I am fully aware at how dumb this question is, but I was curious.

+12  A: 

You need to have some actual byte value to terminate a string - how you represent it in code isn't really relevant.

If you used \$ to terminate strings, what byte value would it have in memory? How would you include that byte value in a string?

You're going to hit this problem whatever you do, if you use a special character to terminate strings. The alternative is to use counted strings, whereby the representation of a string includes its length (eg. BSTR).

RichieHindle
Okay, so \$ would point to some value that is currently unused.
akway
But there are no "unused" byte values. Any byte can occur in a C string - you might as well say that \0 was chosen because it was unused.
RichieHindle
Like what? If you are using UTF-8, then the entire range is used.
Michael Aaron Safyan
C strings do not support UTF-8, traditionally. UTF-8 did not exist when C was invented, and did not exist for a few decades afterwards.
Daniel
Except for U+0000, UTF-8 never encodes to a \0. You're probably thinking of UCS-2/UTF-16.
staticsan
C libaries NEED a terminal value, Pascal style strings use a length parameter. It was merely a design choice, not the only way to do it. And most people would argue that C's string handling sucks because of it.
NoMoreZealots
I don't think "most people" would agree to that. Handling strings in C may not be as elegant as with some other solutions, but from a speed standpoint it's no contest, and C was designed for speed. See Daniel's answer for why.
Gerald
A C string is just an array of bytes, so its encoding depends on how you interpret it. There is no reason that you can't interpret a const char* sequence as a UTF-8 encoded string. Additionally, a lot of UNIX implementations now will interpret const char* parameters as UTF-8 encoded sequences. Java's non-standard "UTF-8" (actually a variant of CESU-8) encodes embedded nulls as something other than '\0', for the standard UTF-8, NUL is '\0' and will terminate the string.
Michael Aaron Safyan
+2  A: 

I guess because it's faster to check, and totally improbable to occur in a reasonable string. Also, remember that C has no concept of strings. A string in C is not something by itself. It's just an array of characters. The fact that it's called and used as a string is purely incidental and conventional.

Stefano Borini
+1  A: 

It causes problems but you can embed a \0 ...

const char* hello = "Hello\0World\0\0";

It causes a problem if you pass this to a standard library functions like strlen, but not otherwise.

A better solution than any string-terminating character might be to prepend the length of the string like ...

const char* hello = "\x0BHello World";

... which is the way some other languages do it.

ChrisW
Nice examples, but you may want the prefixed string-length in your example to actually reflect the length of the string? (I think you forgot to count the space)
jerryjvl
Thanks for noting that. I counted to C, re-counted and decided that C was one too many, and then erroneously wrote down A as if C minus 1 was A. I've corrected it now.
ChrisW
Reminds me of the old days with Hollerith constants in FORTRAN, so you'd have a string like 16HTHIS IS A STRING. Woe be unto you if you miscounted! The newfangled quoted strings that showed up later were much nicer.
David Thornley
+32  A: 

Terminating with a 0 has many performance niceties, which were very much relevant back in the late 60s.

CPUs have instructions for conditional jump on test for 0. In fact, some CPUs even have instructions which will iterate/copy a sequence of bytes up to the 0.

If you used an escaped character instead, you have two test TWO different bytes to assert the end of the string. Not only that's slower, but you lose the ability to iterate one byte at a time, as you need a look-ahead or the ability to backtrack.

Now, other languages (cough, Pascal, cough) use strings in a count/value style. For them, any character is valid, but they always keep a counter with the size of the string. The advantage is clear, but there are disadvantages to this technique too.

For one thing, the string size is limited by the number of bytes the count takes. One byte gives you 255 characters, two bytes gives you 65535, etc. It might be almost irrelevant today, but adding two bytes to every string once was quite expensive.

Edit:

I do not think the question is dumb. In these days of high level languages with memory management, incredible CPU power and obscene amounts of memory, such decisions from the past can well seem senseless. And, indeed, they MIGHT be senseless nowadays, so it's a fine thing to question them.

Daniel
+1 for mentioning the CPU. Your "some CPUs" includes Intel's x86 instruction set (though maybe those instructions aren't used much anymore).
ChrisW
If you define your own string structure, You can make the 255 value of the size byte, indicate that another size byte follows.
Liran Orevi
The performance characteristics are still relevant today in many situations. It's important in embedded systems, and in kernel/driver development where you still want to scrape and save every CPU cycle you can. Which is why C is still king in these areas.
Gerald
It's senseless to not know Wirth's Law. Especially now, where the hardware trends are to push the envelope in how SMALL a computer can be.
NoMoreZealots
@Pete Eddy: that has nothing to do with the issue. I'm talking about decisions made when total available RAM memory was smaller than the memory used by today's CPU registers. The hardware trends are going nowhere close to that.
Daniel
A: 

There is no reason for a nul character to be part of a string except as a terminator; it has no graphical representation, so you wouldn't see it, nor does it act as a control character. As far as text is concerned, it's as out-of-band a value as you can get without using a different representation (e.g., a multibyte value like 0xFFFF).

To slightly rephrase Michael's question, how would you expect "Hello\0World\0" to be handled?

John Bode
A: 

If standard library functions like strlen or printf could (option-wise) look for a end-of-string marker \777 (as an alternative to \000), you could have a constant character string containing \0s:

const char* hello = "Hello\0World\0\0\777"; 
printf("%s\n", hello);

By the way, if you want to send a \0 to stdout (aka -print0) you may use:

putchar(0);