How to implement a variable-length ‘string’-y in C

ELLIOTTCABLE picture ELLIOTTCABLE · Feb 11, 2010 · Viewed 10.5k times · Source

I’ve googled quite a bit, but I can’t find information on how variable-length strings are generally implemented in higher-level languages. I’m creating my own such language, and am not sure where to start with strings.

I have a struct describing the string type, and then a create function that allocates such a ‘string’:

/* A safer `strcpy()`, using `strncpy()` and `sizeof()` */
#define STRCPY(TO, FROM) \
  strncpy(TO, FROM, sizeof(TO)); TO[sizeof(TO) - 1] = '\0'

struct string {
  // …
  char  native[1024];
};

string String__create(char native[]) {
  string this = malloc(sizeof(struct string));

  // …
  STRCPY(this->native, native);

  return this;
}

However, that would only allow 1kb-long strings. That’s sort of silly, and a huge waste of memory in most cases.

Given that I have to declare the memory to be used somehow… how do I go about implementing a string that can (efficiently) store an (effectively) unbounded number of characters?

Answer

MSalters picture MSalters · Feb 11, 2010

Many C++ std::string implementations now use a "Small String Optimization". In pseudo-code:

struct string {
    Int32 length
    union {
        char[12] shortString
        struct {
           char* longerString
           Int32 heapReservedSpace
        }
    }
}

The idea is that string up to 12 characters are stored in the shortString array. The entire string will be contiguous and use only a single cache line. Longer strings are stored on the heap. This leaves you with 12 spare bytes in the string object. The pointer doesn't take all of that, so you can also remember how much memory you've allocated on the heap (>=length). That helps to support scenario's in which you grow a string in small increments.