views:

254

answers:

6

Is it possible to globally disable NUL-terminated strings in GCC?

I am using my own string library, and I have absolutely no need for the final NUL characters as it already stores the proper length internally in a struct.

However, if I wanted to append 10 strings, this would mean that 10 bytes are unnecessarily allocated on the stack. With wide strings it is even worse: As for x86, there are 40 bytes wasted; and for x86_64, 80 bytes!

I defined a macro to add those stack-allocated strings to my struct:

#define AppendString(ppDest, pSource) \
  AppendSubString(ppDest, (*ppDest)->len + 1, pSource, 0, sizeof(pSource) - 1)

Using sizeof(...) - 1 works quite well but I am wondering whether I could get rid of NUL termination in order to save a few bytes?

+3  A: 

This is pretty awful, but you can explicitly specify the length of every character array constant:

char my_constant[6] = "foobar";
assert(sizeof my_constant == 6);

wchar_t wide_constant[6] = L"foobar";
assert(sizeof wide_constant == 6*sizeof(wchar_t));
intgr
How about the empty string?
Sinan Ünür
You could make it a macro: `#define NEW_STRING(var, val) char var[sizeof(val)-1] = val`
Matt B.
If your own string type is defined as a struct of `{length; pointer}` then you might as well use length=0 and pointer=NULL -- accessing pointer would be invalid anyway, because there are no characters to read.
intgr
@intgr:With your code I am getting: "error: wide character array initialized from non-wide string"@Sinan Ünür: Even empty strings are NUL-terminated. That means that *sizeof() - 1* results 0 as expected. In my case AppendSubString() would stop right away and does not add anything.@Matt B.:I am aware of it but actually it does not solve my problem. The AppendString() macro is used in my code like this everywhere: *AppendString(* As you can see, I am not using any variables for this purpose.
@timn: Oops, I forgot that you need to prefix `L""` to wide-character literals. I updated my example.
intgr
Matt B.
Matt B.: Ah, that's interesting. Now I can continue with the current approach without having any bad feelings concerning the memory usage. :)
+1  A: 

I understand you're only dealing with strings declared in your program:

 ....
 char str1[10];
 char str2[12];
 ....

and not with text buffers you allocate with malloc() and friends otherwise sizeof is not going to help you.

Anyway, i would just think twice about removing the \0 at the end: you would lose the compatibility with C standard library functions.

Unless you are going to rewrite any single string function for your library (sprintf, for example), are you sure you want to do it?

Remo.D
I am not using heap allocated strings because that is exactly what my string library is for. Manually allocating memory is too dangerous as there is always the risk of buffer overflows.As stated above rewriting the string functions was not really hard. Thanks to C99 I might use a simple hack in order to still keep compatibility to Glibc functions: char tmp[string->len + 1]; tmp[string->len + 1] = '\0'; printf("%s", tmp);
I'm sure you have good reasons for doing this but, if I understand how buffer overrun exploits work, having the string on the stack is more dangerous than having them allocated on the heap.About adding the '\0' at the end, I don't see how the content of string is copied into temp, and I think that you're using a gcc extension rather than C99.
Remo.D
A: 

I can't remember the details, but when I do

char my_constant[5]

it is possible that it will reserve 8 bytes anyway, because some machines can't address the middle of a word.

It's nearly always best to leave this sort of thing to the compiler and let it handle the optmisation for you, unless there is a really really good reason to do so.

MatthieuF
It's called "alignment" (also "internal fragmentation"). It's true that throwing away the NUL byte does not reduce most strings, but when it does, it reduces by the alignment size. So, on average, each string would still consume 1 byte less memory.
intgr
A: 

Indeed this is only in case you are really low in memory. Otherwise I don't recommend to do so.

It seems most proper way to do thing you are talking about is:

  • To prepare some minimal 'listing' file in a form of:
    string1_constant_name "str1"
    string2_constant_name "str2"
    ...
  • To construct utility which processes your file and generates declarations such as
    const char string1_constant[4] = "str1";

Of course I'd not recommend to do this by hands, because otherwise you can get in trouble after any string change.

So now you have both non-terminated strings because of fixed auto-generated arrays and also you have sizeof() for every variable. This solution seems acceptable.

Benefits are easy localization, possibility to add some level of checks to make this solution risk lower and R/O data segment savings.

Drawback is need to include all of such string constants in every module (as include to keep sizeof() known). So this only makes sense if your linker merges such symbols (some don't).

Roman Nikitchenko
Nice idea! Unfortunately it requires me to always pull the whole code through a "preprocessor" before compiling. If there is really no such option in GCC to turn off NUL termination, I will stick with my current approach.
A: 

If you're not using any of the Standard Library function that deal with strings you can forget about the NUL terminating byte.

No strlen(), no fgets(), no atoi(), no strtoul(), no fopen(), no printf() with the %s conversion specifier ...

Declare your "not quite C strings" with just the needed space;

struct NotQuiteCString { /* ... */ };

struct NotQuiteCString variable;
variable.data = malloc(5);
data[0] = 'H'; /* ... */ data[4] = 'o'; /* "hello" */
pmg
Basically this is more or less what I am currently doing.It might be indeed better to keep NUL-termination generally turned on but perhaps there is something like a pragma allowing me to enable/disable the NUL terminating byte for a given code part, i.e. my string library functions?
Just don't use it. You use arrays of `int` everyday, lots of time a day ... and you never even once used a terminating `int`. Do the same with `char` arrays (when I do that I specifically "sign" my `char`s: `signed char` or `unsigned char`: to me, only plain `char` can be `C` strings).
pmg
A: 

Aren't these similar to Pascal-style strings, or Hollerith Strings? I think this is only useful if you actually want the String data to preserve NULLs, in which you're really pushing around arbitrary memory, not "strings" per se.

Fletch
Yes, I am using them in a similar way like Hollerith strings. In my code it is defined as follows: typedef struct { unsigned int len; unsigned int maxLen; char buf[0]; }There are other advantages as well. Wikipedia (taken from "String literal") says: * eliminates text searching (for the delimiter character) and therefore requires significantly less overhead * avoids the (100% programmer induced) problem of delimiter collision * enables the inclusion of metacharacters that might otherwise be mistaken as commands * can be used for quite effective data compression of plain text strings