In C, strings are terminated with null ( \0 ) which causes problems when you want to put a null in a strings. Why not have a special escaped character such as \$ or something?
I am fully aware at how dumb this question is, but I was curious.
In C, strings are terminated with null ( \0 ) which causes problems when you want to put a null in a strings. Why not have a special escaped character such as \$ or something?
I am fully aware at how dumb this question is, but I was curious.
You need to have some actual byte value to terminate a string - how you represent it in code isn't really relevant.
If you used \$
to terminate strings, what byte value would it have in memory? How would you include that byte value in a string?
You're going to hit this problem whatever you do, if you use a special character to terminate strings. The alternative is to use counted strings, whereby the representation of a string includes its length (eg. BSTR).
I guess because it's faster to check, and totally improbable to occur in a reasonable string. Also, remember that C has no concept of strings. A string in C is not something by itself. It's just an array of characters. The fact that it's called and used as a string is purely incidental and conventional.
It causes problems but you can embed a \0 ...
const char* hello = "Hello\0World\0\0";
It causes a problem if you pass this to a standard library functions like strlen
, but not otherwise.
A better solution than any string-terminating character might be to prepend the length of the string like ...
const char* hello = "\x0BHello World";
... which is the way some other languages do it.
Terminating with a 0 has many performance niceties, which were very much relevant back in the late 60s.
CPUs have instructions for conditional jump on test for 0. In fact, some CPUs even have instructions which will iterate/copy a sequence of bytes up to the 0.
If you used an escaped character instead, you have two test TWO different bytes to assert the end of the string. Not only that's slower, but you lose the ability to iterate one byte at a time, as you need a look-ahead or the ability to backtrack.
Now, other languages (cough, Pascal, cough) use strings in a count/value style. For them, any character is valid, but they always keep a counter with the size of the string. The advantage is clear, but there are disadvantages to this technique too.
For one thing, the string size is limited by the number of bytes the count takes. One byte gives you 255 characters, two bytes gives you 65535, etc. It might be almost irrelevant today, but adding two bytes to every string once was quite expensive.
Edit:
I do not think the question is dumb. In these days of high level languages with memory management, incredible CPU power and obscene amounts of memory, such decisions from the past can well seem senseless. And, indeed, they MIGHT be senseless nowadays, so it's a fine thing to question them.
There is no reason for a nul character to be part of a string except as a terminator; it has no graphical representation, so you wouldn't see it, nor does it act as a control character. As far as text is concerned, it's as out-of-band a value as you can get without using a different representation (e.g., a multibyte value like 0xFFFF).
To slightly rephrase Michael's question, how would you expect "Hello\0World\0" to be handled?
If standard library functions like strlen or printf could (option-wise) look for a end-of-string marker \777 (as an alternative to \000), you could have a constant character string containing \0s:
const char* hello = "Hello\0World\0\0\777";
printf("%s\n", hello);
By the way, if you want to send a \0 to stdout (aka -print0) you may use:
putchar(0);