I’ve googled quite a bit, but I can’t find information on how variable-length strings are generally implemented in higher-level languages. I’m creating my own such language, and am not sure where to start with strings.
I have a struct describing the string
type, and then a create
function that allocates such a ‘string’:
/* A safer `strcpy()`, using `strncpy()` and `sizeof()` */
#define STRCPY(TO, FROM) \
strncpy(TO, FROM, sizeof(TO)); TO[sizeof(TO) - 1] = '\0'
struct string {
// …
char native[1024];
};
string String__create(char native[]) {
string this = malloc(sizeof(struct string));
// …
STRCPY(this->native, native);
return this;
}
However, that would only allow 1kb-long strings. That’s sort of silly, and a huge waste of memory in most cases.
Given that I have to declare the memory to be used somehow… how do I go about implementing a string that can (efficiently) store an (effectively) unbounded number of characters?
Many C++ std::string
implementations now use a "Small String Optimization". In pseudo-code:
struct string {
Int32 length
union {
char[12] shortString
struct {
char* longerString
Int32 heapReservedSpace
}
}
}
The idea is that string up to 12 characters are stored in the shortString
array. The entire string will be contiguous and use only a single cache line. Longer strings are stored on the heap. This leaves you with 12 spare bytes in the string object. The pointer doesn't take all of that, so you can also remember how much memory you've allocated on the heap (>=length
). That helps to support scenario's in which you grow a string in small increments.