views:

1260

answers:

10

I'm writing a language interpreter in C, and my string type contains a length attribute, like so:

struct String
{
    char* characters;
    size_t length;
};

Because of this, I have to spend a lot of time in my interpreter handling this kind of string manually since C doesn't include built-in support for it. I've considered switching to simple null-terminated strings just to comply with the underlying C, but there seem to be a lot of reasons not to:

Bounds-checking is built-in if you use "length" instead of looking for a null.

You have to traverse the entire string to find its length.

You have to do extra stuff to handle a null character in the middle of a null-terminated string.

Null-terminated strings deal poorly with Unicode.

Non-null-terminated strings can intern more, i.e. the characters for "Hello, world" and "Hello" can be stored in the same place, just with different lengths. This can't be done with null-terminated strings.

String slice (note: strings are immutable in my language). Obviously the second is slower (and more error-prone: think about adding error-checking of begin and end to both functions).

struct String slice(struct String in, size_t begin, size_t end)
{
    struct String out;
    out.characters = in.characters + begin;
    out.length = end - begin;

    return out;
}

char* slice(char* in, size_t begin, size_t end)
{
    char* out = malloc(end - begin + 1);

    for(int i = 0; i < end - begin; i++)
        out[i] = in[i + begin];

    out[end - begin] = '\0';

    return out;
}

After all this, my thinking is no longer about whether I should use null-terminated strings: I'm thinking about why C uses them!

So my question is: are there any benefits to null-termination that I'm missing?

A: 

I think main reason is that standard says nothing concrete about size of any type other than char. But sizeof(char) = 1 and that is definitely not enough for string size.

Kirill V. Lyadvinsky
It's enough for 2^CHAR_BIT characters per string.
Daniel Earwicker
It is only 255 characters. Too little.
Kirill V. Lyadvinsky
No, it's enough for 2^CHAR_BIT - 1 characters per string, and if you've never had to deal with strings longer than 255 characters then you've only had to program for a very limited problem domain. However C *does* say concrete things about other integral types - for example int has to have a range of at least -32767 to +32767. In particular C says that size_t must be able to hold the size of any object, so that would be fine as a standardised string length.
caf
@caf, you aren't thinking creatively enough if you can't see how to represent that extra unit of character length. :)
Daniel Earwicker
@Jla3ep Did you look at my code sample? I used size_t to store the length for a reason.
Imagist
@caf I think Earwicker means that you can have the `characters` pointer point to `NULL`.
Imagist
...for a zero-length string (I hate that you can't edit comments!).
Imagist
@Imagist, you asked why C uses null-terminate strings. I believe that size of `size_t` was equal to 1 in the beginning.
Kirill V. Lyadvinsky
+3  A: 

One benefit is that with null-termination any tail of a null-terminated string is also a null-terminated string. If you need to pass a substring starting with Nth character (provided there's no buffer overrun) into some string-handling function - no problem, just pass the offseeted address there. When storing size in some other way you would need to construct a new string.

sharptooth
Can you give an example of a string of that you might want to print the end of the string?
That can be used when concatenating strings - you could want to append not the entire string, but only the tail of it. Then you call strcat( target, source + offset); - and it's done.
sharptooth
Take a front trim of white space. You can determine the first non-white space char and instead of actually changing the string, you can just pass the starting offset in, saving you either allocating new memory, or copying data.
Dan McGrath
That's not all that different for what I do with my struct: `struct String new; new.characters = old.characters + offset; new.length = old.length - offset;` It's a bit of bookkeeping but comes out to what, 5 instructions? This seems trivial compared to the difference if you needed to do something to the beginning of the string instead of the end.
Imagist
It makes it really easy to do things like recursive string matching, spelling correction, etc., if you can treat the string like you would a list in Lisp.
Mike Dunlavey
+7  A: 

The usual solution is to do both - keep the length and maintain the null terminator. It's not much extra work and means that you are always ready to pass the string to any function.

Null-terminated strings are often a drain on performance, for the obvious reason that the time taken to discover the length depends on the length. On the plus side, they are the standard way of representing strings in C, so you have little choice but to support them if you want to use most C libraries.

Daniel Earwicker
This is what Lua does. It makes interfacing to C for normal use cases very clean, and still supports arbitrary length binary buffers.
RBerteig
It's what most things do! You don't even have to maintain the null terminator all the time - just do `str[len] = '\0'` whenever you need to. This is what `std::string::c_str`` usually does in C++.
Daniel Earwicker
By most things, I mean most string classes, and most interpreter string representations. One widely-used example on Windows is the BSTR type.
Daniel Earwicker
This is exactly why I asked this question; I thought I might be missing some solution. It seems obvious now but I didn't think of this!
Imagist
Cool! Well, you see that green check icon next to my answer...?
Daniel Earwicker
+1 After all it is the c++ designers solution :)
neuro
@Earwicker, lol on the green check!
Mark Harrison
@Earwicker I was planning on checking that, I just like to wait a few days. There are few things more annoying to me than people who accept an answer within minutes of asking the question.
Imagist
:) It means more if you have to wait for something, I think!
Daniel Earwicker
+1  A: 

Although I prefer the array + len method in most cases, there are valid reasons for using null-terminated.

Take a 32-bit system.

To store a 7 byte string
char * + size_t + 8 bytes = 19 bytes

To store a 7 byte null-term string
char * + 8 = 16 bytes.

null-term arrays don't need to be immutable like your strings do. I can happily truncate the c-string by simply places a null char. If you code, you would need to create a new string, which involves allocating memory.

Depending on the usage of the strings, your strings will never be able to match the performance possible with c-strings as opposed to your strings.

Dan McGrath
You can truncate a string-with-length by just reducing the length. Usually this means you have two lengths - the current length of the string plus the amount of memory you have currently allocated for the string. This allows it to be dynamically resized without having to realloc on every modification.
Jason Williams
Indeed, that is the way to go; I had based my answer on the op's string struct, which would enable you to reduce the length, but never to utilise that space again. Ropes are another interesting way to handle strings. http://en.wikipedia.org/wiki/Rope_(computer_science)
Dan McGrath
A few questions: assuming a byte is 8 bits, shouldn't a 32-bit system have `sizeof(size_t) == 4` and `sizeof(char*) == 4`? And with my method you don't have to use 8 characters for the first method. So I'm getting: 4 + 4 + 7 = 15 for my method, and 4 + 8 = 12 for the null-terminating method. I'm not contesting your point, just the math that leads to your point.
Imagist
@Dan McG (comment) My strings will be stored in a garbage-collected system, which will allow me to reclaim the space. And my interpreter uses ropes internally; the garbage collector will use idle cycles to flatten ropes into strings.
Imagist
+4  A: 

Lengths have their problems too.

  • The length takes extra storage (not such an issue now, but a big factor 30 years ago)
  • Every time you alter a string you have to update the length, so you get reduced performance across the board
  • With a nul-terinated string you can still use a length or store a pointer to the last character, so if you are doing lots of string manips, you can still equal the performance of string-with-length.
  • zero-terminated strings are much simpler - The nul terminator is just a convention used by methods like strcat to determine the end of the string. So you can store them in a regular char array rather than having to use a struct.
Jason Williams
Extra storage can still be a big issue for embedded systems, which is one reason to stress keeping the language light-weight.
Jimmy
@Jimmy My issue there is: on such embedded systems, why are you using strings? I don't think I even ever used a `char` back when I was doing robotics programming. The only example I can think of is if you're programming for an LED display (like those scrolling text things or the ones on soft drink machines) but the functionality there is so simple that I still have a difficult time imagining the extra 3 bytes being a problem (4 byte int - 1 byte since you don't have to store the null character).
Imagist
What you suggest is not that trivial btw. Who will take ownership of the buffer? Will the newly created temporary string do that? I doubt you want this and then you need a way to have such non-owning strings to avoid copying.
sharptooth
+3  A: 

Just throwing out some hypotheticals:

  • there's no way to get a "wrong" implementation of null terminated strings. A standardized struct however could have vendor-specific implementations.
  • no structs are required. Null terminated strings are "built-in" so to speak, by virtue of being a special case of a char*.
Jimmy
+14  A: 

From Joel's Back to Basics:

Why do C strings work this way? It's because the PDP-7 microprocessor, on which UNIX and the C programming language were invented, had an ASCIZ string type. ASCIZ meant "ASCII with a Z (zero) at the end."

Is this the only way to store strings? No, in fact, it's one of the worst ways to store strings. For non-trivial programs, APIs, operating systems, class libraries, you should avoid ASCIZ strings like the plague.

+1 Back to basics is an excellent post.
Brian Rasmussen
Denis Ritchie opinion is somewhat different. BCPL had a length+content representation, with the length contained in one byte. B switched to terminated string "partially to avoid the limitation on the length of a string caused by holding the count in an 8 or 9 bit slot, and partly because maintaining the count seemed, in our experience, less convenient than using a terminator." (The Development of the C Language, http://cm.bell-labs.com/cm/cs/who/dmr/chist.pdf)
AProgrammer
+2  A: 

One advantage of nul-terminated strings is that if you are walking through a string character-by-character, you only need to keep a single pointer to address the string:

while (*s)
{
    *s = toupper(*s);
    s++;
}

whereas for strings without sentinels, you need to keep two bits of state around: either a pointer and index:

while (i < s.length)
{
    s.data[i] = toupper(s.data[i]);
    i++;
}

...or a current pointer and a limit:

s_end = s + length;
while (s < s_end)
{
    *s = toupper(*s);
    s++;
}

When CPU registers were a scarce resource (and compilers were worse at allocating them), this was important. Now, not so much.

caf
"When CPU registers were a scarce resource" - registers are still a scarce resource on x86 and x64.
Jimmy
I don't get it; if I'm storing string in the `struct` example I gave, why can't I use that as the limit?
Imagist
The point is that during a processing loop like the above, length-based strings like yours end up using two registers for string book-keeping whereas sentinel-based strings like idiomatic C strings only use one (the other one is obtained "for free" because the character values are being loaded in order to process them anyway).
caf
Ah, I see; the point is registers not memory.
Imagist
+1  A: 

You're absolutely right that 0-termination is a method which is poor with respect to type checking and performance for part of the operations. The answers on this page already summarize the origins and uses for it.

I liked the way Delphi stored strings. I believe it maintains a length/maxlength in before the (variable length) string. This way the strings can be null-terminated for compatibility.

My concerns with your mechanism: - additional pointer - immutability si in the core parts of your language; normally string types are not immutable so if you ever reconsider than it'll be tough. You'd need to implement a 'create copy on change' mechanism - use of malloc (hardly efficient, but may be included here just for ease?)

Good luck; writing your own interpreter can be very educational in understanding mainly the grammar and syntax of programming languages! (at least, it ws for me)

Adriaan
_Most_ high level languages have immutable strings.
Nick Johnson
+1  A: 

Slightly offtopic, but there's a more efficient way to do length-prefixed strings than the way you describe. Create a struct like this (valid in C99 and up):

struct String 
{
  size_t length;
  char characters[0];
}

This creates a struct that has the length at the start, with the 'characters' element usable as a char* just as you would with your current struct. The difference, however, is that you can allocate only a single item on the heap for each string, instead of two. Allocate your strings like this:

mystr = malloc(sizeof(String) + strlen(cstring))

Eg - the length of the struct (which is just the size_t) plus enough space to put the actual string after it.

If you don't want to use C99, you can also do this with "char characters[1]" and subtract 1 from the length of the string to allocate.

Nick Johnson