views:

329

answers:

4

I’ve googled quite a bit, but I can’t find information on how variable-length strings are generally implemented in higher-level languages. I’m creating my own such language, and am not sure where to start with strings.

I have a struct describing the string type, and then a create function that allocates such a ‘string’:

/* A safer `strcpy()`, using `strncpy()` and `sizeof()` */
#define STRCPY(TO, FROM) \
  strncpy(TO, FROM, sizeof(TO)); TO[sizeof(TO) - 1] = '\0'

struct string {
  // …
  char  native[1024];
};

string String__create(char native[]) {
  string this = malloc(sizeof(struct string));

  // …
  STRCPY(this->native, native);

  return this;
}

However, that would only allow 1kb-long strings. That’s sort of silly, and a huge waste of memory in most cases.

Given that I have to declare the memory to be used somehow… how do I go about implementing a string that can (efficiently) store an (effectively) unbounded number of characters?

+2  A: 

realloc will relocate your memory to a bigger buffer -- simply using this command will allow you to resize your string. Use the following struct for your string:

struct mystring
{
    char * string;
    size_t size;
};

The important part being keeping a track of the size of your string, and having every string manipulation function testing if the size makes sense.

You could pre-allocate a large buffer and keep adding to it, and only realloc when said buffer is full -- you have to implement all the functions for this. It is preferable (far less error prone, and more secure) to mutate string by moving from one immutable string to another (using the semantics of realoc).

Hassan Syed
Yes, but how can I declare this in the struct? Could you include some code, based on the example I provided? I’d be much appreciative.
elliottcable
Ah, I spoke too soon. Thanks for the edit, and useful information. (-:
elliottcable
+3  A: 

The common approach is to have a field for length and a pointer to a dynamically allocated region of memory to hold the characters.

typedef struct string {
    size_t length;
    unsigned char *native;
} string_t;

string_t String__create(char native[]) {
    string_t this;
    this.length = strlen(native);
    this.native = malloc(this.length+1);
    if (this.native) {
        strncpy(this.native, native, this.length+1);
    } else {
        this.length = 0;
    }
    return this;
}

If you want to dynamically allocate the string_t:

string_t* String__create(char native[]) {
    string_t* this;
    if (this = malloc(sizeof(*this))) {
        this->length = strlen(native);
        this->native = malloc(this->length+1);
        if (this->native) {
            strncpy(this->native, native, this->length+1);
        } else {
            free(this);
            this=NULL;
        }
    }
    return this;
}
void String__delete(string_t** str) {
    free((*str)->native);
    free((*str));
    *str=NULL;
}
outis
+6  A: 

Many C++ std::string implementations now use a "Small String Optimization". In pseudo-code:

struct string {
    Int32 length
    union {
        char[12] shortString
        struct {
           char* longerString
           Int32 heapReservedSpace
        }
    }
}

The idea is that string up to 12 characters are stored in the shortString array. The entire string will be contiguous and use only a single cache line. Longer strings are stored on the heap. This leaves you with 12 spare bytes in the string object. The pointer doesn't take all of that, so you can also remember how much memory you've allocated on the heap (>=length). That helps to support scenario's in which you grow a string in small increments.

MSalters
I don’t quite understand how that’s legal; how do you access the struct portion of the union, if it has no name?
elliottcable
Also, this solution is really neat. I’m *definitely* going to use this—+1, and thanks for responding ^_^
elliottcable
Hey, you're designing the language - the language itself doesn't need to access things by name. Furthermore, your language could support the "anonymous struct" rule: names of an anonymous struct are visible in the enclosing scope instead. Same thing as the anonymous union I used, really.
MSalters
Oh, you misunderstood me. I’m not writing a C-like language; I’m writing a language interpreter *in* C. I was asking how, in C, I access the anonymous struct/unions you used there.**Edit:** Ah, scratch that. I just did a bit of googling, found out it’s a C++ feature. Thanks!
elliottcable
@elliottcable: This answer talks about how this is typically done in C++, which also allows anonymous unions. In C, just add an inner name to the union member of the struct and you'll be fine.
unwind
+3  A: 

In addition to what others have told you, I'd also include the concept of "capacity": It is not possible to know the size of the allocated vector in memory, you must store it. If you do a sizeof of the String struct, it will return you 4 bytes * 3 numeric fields = 12 bytes (probably bigger due to padding used in structures). Also, you cannot get the length of allocated memory through sizeof.

typedef struct _mystring {
        char * native;
        size_t size;
        size_t capacity;
} String;

This way, capacity always bears the actual size of the chunk in which your string is. Say that your string goes shorter: you don't have to realloc in order to get an exact match between the capacity and the size of your string. Also, you can alloc from the beginning the characters you expect the string to have, and not the characters the initial string has. Finally, you can mimic the C++ string char dynamic vector and double capacity each time the string grows beyond the capacity limit. All of these will keep memory operations to a minimum, which will translate in better performance (you will also waste some memory, but never as much as 1Kb).

String String__create(char native[], size_t capacity) {
  String this;

  this.size = strlen( native );
  if ( capacity < ( this.size + 1 ) )
        this.capacity = this.size + 1;
  else  this.capacity = capacity;

  this.native = (char *) malloc( capacity * sizeof( char ) );
  strcpy( this.native, native );

  return this;
}

String * String__set(String *this, char native[]) {
    this->size = strlen( native );

    if ( this->size >= this->capacity ) {
        do {
            this->capacity <<= 1;
        } while( this->size > this->capacity );

        this->native = realloc( this->native, this->capacity );
    }

    strcpy( this->native, native );

    return this;
}

void String__delete(String *this)
{
    free( this->native );
}
Baltasarq
Why is this necessary/useful? Cannot I simply ‘sizeof’ the string to get the current capacity?
elliottcable
Actually not. If you do a sizeof of the String struct, it will return you 4 bytes * 3 numeric fields = 12 bytes (probably bigger due to padding used in structures). Also, you cannot get the length of allocated memory through sizeof.
Baltasarq
Ah, forgive me. I didn’t know that. I can’t upvote your answer again unless you edit it; if you do so, I’ll gladly add a +1 (-:
elliottcable