ansaurus

Question

Why null-terminated strings? Or: null-terminated vs. characters + length storage.

Answer 1

A:

I think main reason is that standard says nothing concrete about size of any type other than char. But sizeof(char) = 1 and that is definitely not enough for string size.

Kirill V. Lyadvinsky 2009-08-10 06:07:18

It's enough for 2^CHAR_BIT characters per string.

Daniel Earwicker 2009-08-10 06:17:05

It is only 255 characters. Too little.

Kirill V. Lyadvinsky 2009-08-10 06:31:36

No, it's enough for 2^CHAR_BIT - 1 characters per string, and if you've never had to deal with strings longer than 255 characters then you've only had to program for a very limited problem domain. However C *does* say concrete things about other integral types - for example int has to have a range of at least -32767 to +32767. In particular C says that size_t must be able to hold the size of any object, so that would be fine as a standardised string length.

caf 2009-08-10 06:35:36

@caf, you aren't thinking creatively enough if you can't see how to represent that extra unit of character length. :)

Daniel Earwicker 2009-08-10 07:32:23

@Jla3ep Did you look at my code sample? I used size_t to store the length for a reason.

Imagist 2009-08-10 07:37:13

@caf I think Earwicker means that you can have the `characters` pointer point to `NULL`.

Imagist 2009-08-10 07:39:09

...for a zero-length string (I hate that you can't edit comments!).

Imagist 2009-08-10 07:39:54

@Imagist, you asked why C uses null-terminate strings. I believe that size of `size_t` was equal to 1 in the beginning.

Kirill V. Lyadvinsky 2009-08-10 08:04:51

Answer 2

+3 A:

One benefit is that with null-termination any tail of a null-terminated string is also a null-terminated string. If you need to pass a substring starting with Nth character (provided there's no buffer overrun) into some string-handling function - no problem, just pass the offseeted address there. When storing size in some other way you would need to construct a new string.

sharptooth 2009-08-10 06:09:01

Can you give an example of a string of that you might want to print the end of the string?

2009-08-10 06:12:14

That can be used when concatenating strings - you could want to append not the entire string, but only the tail of it. Then you call strcat( target, source + offset); - and it's done.

sharptooth 2009-08-10 06:16:35

Take a front trim of white space. You can determine the first non-white space char and instead of actually changing the string, you can just pass the starting offset in, saving you either allocating new memory, or copying data.

Dan McGrath 2009-08-10 06:17:01

That's not all that different for what I do with my struct: `struct String new; new.characters = old.characters + offset; new.length = old.length - offset;` It's a bit of bookkeeping but comes out to what, 5 instructions? This seems trivial compared to the difference if you needed to do something to the beginning of the string instead of the end.

Imagist 2009-08-10 08:02:16

It makes it really easy to do things like recursive string matching, spelling correction, etc., if you can treat the string like you would a list in Lisp.

Mike Dunlavey 2009-08-13 00:57:24

Answer 3

+7 A:

The usual solution is to do both - keep the length and maintain the null terminator. It's not much extra work and means that you are always ready to pass the string to any function.

Null-terminated strings are often a drain on performance, for the obvious reason that the time taken to discover the length depends on the length. On the plus side, they are the standard way of representing strings in C, so you have little choice but to support them if you want to use most C libraries.

Daniel Earwicker 2009-08-10 06:12:14

This is what Lua does. It makes interfacing to C for normal use cases very clean, and still supports arbitrary length binary buffers.

RBerteig 2009-08-10 06:36:20

It's what most things do! You don't even have to maintain the null terminator all the time - just do `str[len] = '\0'` whenever you need to. This is what `std::string::c_str`` usually does in C++.

Daniel Earwicker 2009-08-10 06:40:12

By most things, I mean most string classes, and most interpreter string representations. One widely-used example on Windows is the BSTR type.

Daniel Earwicker 2009-08-10 06:41:57

This is exactly why I asked this question; I thought I might be missing some solution. It seems obvious now but I didn't think of this!

Imagist 2009-08-10 07:29:23

Cool! Well, you see that green check icon next to my answer...?

Daniel Earwicker 2009-08-10 07:34:31

+1 After all it is the c++ designers solution :)

neuro 2009-08-10 08:31:28

@Earwicker, lol on the green check!

Mark Harrison 2009-08-11 09:13:12

@Earwicker I was planning on checking that, I just like to wait a few days. There are few things more annoying to me than people who accept an answer within minutes of asking the question.

Imagist 2009-08-11 12:31:38

:) It means more if you have to wait for something, I think!

Daniel Earwicker 2009-08-11 14:32:27

Answer 4

+1 A:

Although I prefer the array + len method in most cases, there are valid reasons for using null-terminated.

Take a 32-bit system.

To store a 7 byte string
char * + size_t + 8 bytes = 19 bytes

To store a 7 byte null-term string
char * + 8 = 16 bytes.

null-term arrays don't need to be immutable like your strings do. I can happily truncate the c-string by simply places a null char. If you code, you would need to create a new string, which involves allocating memory.

Depending on the usage of the strings, your strings will never be able to match the performance possible with c-strings as opposed to your strings.

Dan McGrath 2009-08-10 06:14:04

You can truncate a string-with-length by just reducing the length. Usually this means you have two lengths - the current length of the string plus the amount of memory you have currently allocated for the string. This allows it to be dynamically resized without having to realloc on every modification.

Jason Williams 2009-08-10 06:17:39

Indeed, that is the way to go; I had based my answer on the op's string struct, which would enable you to reduce the length, but never to utilise that space again. Ropes are another interesting way to handle strings. http://en.wikipedia.org/wiki/Rope_(computer_science)

Dan McGrath 2009-08-10 06:52:12

A few questions: assuming a byte is 8 bits, shouldn't a 32-bit system have `sizeof(size_t) == 4` and `sizeof(char*) == 4`? And with my method you don't have to use 8 characters for the first method. So I'm getting: 4 + 4 + 7 = 15 for my method, and 4 + 8 = 12 for the null-terminating method. I'm not contesting your point, just the math that leads to your point.

Imagist 2009-08-10 07:50:29

@Dan McG (comment) My strings will be stored in a garbage-collected system, which will allow me to reclaim the space. And my interpreter uses ropes internally; the garbage collector will use idle cycles to flatten ropes into strings.

Imagist 2009-08-10 07:54:12

Answer 5

+4 A:

Lengths have their problems too.

The length takes extra storage (not such an issue now, but a big factor 30 years ago)
Every time you alter a string you have to update the length, so you get reduced performance across the board
With a nul-terinated string you can still use a length or store a pointer to the last character, so if you are doing lots of string manips, you can still equal the performance of string-with-length.
zero-terminated strings are much simpler - The nul terminator is just a convention used by methods like strcat to determine the end of the string. So you can store them in a regular char array rather than having to use a struct.

Jason Williams 2009-08-10 06:14:20

Extra storage can still be a big issue for embedded systems, which is one reason to stress keeping the language light-weight.

Jimmy 2009-08-10 06:45:29

@Jimmy My issue there is: on such embedded systems, why are you using strings? I don't think I even ever used a `char` back when I was doing robotics programming. The only example I can think of is if you're programming for an LED display (like those scrolling text things or the ones on soft drink machines) but the functionality there is so simple that I still have a difficult time imagining the extra 3 bytes being a problem (4 byte int - 1 byte since you don't have to store the null character).

Imagist 2009-08-10 07:34:37

What you suggest is not that trivial btw. Who will take ownership of the buffer? Will the newly created temporary string do that? I doubt you want this and then you need a way to have such non-owning strings to avoid copying.

sharptooth 2009-08-10 08:17:28

Answer 6

+3 A:

Just throwing out some hypotheticals:

there's no way to get a "wrong" implementation of null terminated strings. A standardized struct however could have vendor-specific implementations.
no structs are required. Null terminated strings are "built-in" so to speak, by virtue of being a special case of a char*.

Jimmy 2009-08-10 06:14:38

Answer 7

+14 A:

From Joel's Back to Basics:

Why do C strings work this way? It's because the PDP-7 microprocessor, on which UNIX and the C programming language were invented, had an ASCIZ string type. ASCIZ meant "ASCII with a Z (zero) at the end."

Is this the only way to store strings? No, in fact, it's one of the worst ways to store strings. For non-trivial programs, APIs, operating systems, class libraries, you should avoid ASCIZ strings like the plague.

2009-08-10 06:15:08

+1 Back to basics is an excellent post.

Brian Rasmussen 2009-08-10 06:31:36

Denis Ritchie opinion is somewhat different. BCPL had a length+content representation, with the length contained in one byte. B switched to terminated string "partially to avoid the limitation on the length of a string caused by holding the count in an 8 or 9 bit slot, and partly because maintaining the count seemed, in our experience, less convenient than using a terminator." (The Development of the C Language, http://cm.bell-labs.com/cm/cs/who/dmr/chist.pdf)

AProgrammer 2009-08-10 06:40:55

Answer 8

+2 A:

One advantage of nul-terminated strings is that if you are walking through a string character-by-character, you only need to keep a single pointer to address the string:

while (*s)
{
    *s = toupper(*s);
    s++;
}

whereas for strings without sentinels, you need to keep two bits of state around: either a pointer and index:

while (i < s.length)
{
    s.data[i] = toupper(s.data[i]);
    i++;
}

...or a current pointer and a limit:

s_end = s + length;
while (s < s_end)
{
    *s = toupper(*s);
    s++;
}

When CPU registers were a scarce resource (and compilers were worse at allocating them), this was important. Now, not so much.

caf 2009-08-10 06:45:34

"When CPU registers were a scarce resource" - registers are still a scarce resource on x86 and x64.

Jimmy 2009-08-10 06:49:12

I don't get it; if I'm storing string in the `struct` example I gave, why can't I use that as the limit?

Imagist 2009-08-10 07:56:44

The point is that during a processing loop like the above, length-based strings like yours end up using two registers for string book-keeping whereas sentinel-based strings like idiomatic C strings only use one (the other one is obtained "for free" because the character values are being loaded in order to process them anyway).

caf 2009-08-10 09:03:53

Ah, I see; the point is registers not memory.

Imagist 2009-08-10 09:49:42

Answer 9

+1 A:

You're absolutely right that 0-termination is a method which is poor with respect to type checking and performance for part of the operations. The answers on this page already summarize the origins and uses for it.

I liked the way Delphi stored strings. I believe it maintains a length/maxlength in before the (variable length) string. This way the strings can be null-terminated for compatibility.

My concerns with your mechanism: - additional pointer - immutability si in the core parts of your language; normally string types are not immutable so if you ever reconsider than it'll be tough. You'd need to implement a 'create copy on change' mechanism - use of malloc (hardly efficient, but may be included here just for ease?)

Good luck; writing your own interpreter can be very educational in understanding mainly the grammar and syntax of programming languages! (at least, it ws for me)

Adriaan 2009-08-10 09:53:19

_Most_ high level languages have immutable strings.

Nick Johnson 2009-08-20 20:14:52

Answer 10

+1 A:

Slightly offtopic, but there's a more efficient way to do length-prefixed strings than the way you describe. Create a struct like this (valid in C99 and up):

struct String 
{
  size_t length;
  char characters[0];
}

This creates a struct that has the length at the start, with the 'characters' element usable as a char* just as you would with your current struct. The difference, however, is that you can allocate only a single item on the heap for each string, instead of two. Allocate your strings like this:

mystr = malloc(sizeof(String) + strlen(cstring))

Eg - the length of the struct (which is just the size_t) plus enough space to put the actual string after it.

If you don't want to use C99, you can also do this with "char characters[1]" and subtract 1 from the length of the string to allocate.

Nick Johnson 2009-08-20 20:13:46

ansaurus

tags:

views:

answers:

Why null-terminated strings? Or: null-terminated vs. characters + length storage.

related questions