I'm writing a language interpreter in C, and my string
type contains a length
attribute, like so:
struct String
{
char* characters;
size_t length;
};
Because of this, I have to spend a lot of time in my interpreter handling this kind of string manually since C doesn't include built-in support for it. I've considered switching to simple null-terminated strings just to comply with the underlying C, but there seem to be a lot of reasons not to:
Bounds-checking is built-in if you use "length" instead of looking for a null.
You have to traverse the entire string to find its length.
You have to do extra stuff to handle a null character in the middle of a null-terminated string.
Null-terminated strings deal poorly with Unicode.
Non-null-terminated strings can intern more, i.e. the characters for "Hello, world" and "Hello" can be stored in the same place, just with different lengths. This can't be done with null-terminated strings.
String slice (note: strings are immutable in my language). Obviously the second is slower (and more error-prone: think about adding error-checking of begin
and end
to both functions).
struct String slice(struct String in, size_t begin, size_t end)
{
struct String out;
out.characters = in.characters + begin;
out.length = end - begin;
return out;
}
char* slice(char* in, size_t begin, size_t end)
{
char* out = malloc(end - begin + 1);
for(int i = 0; i < end - begin; i++)
out[i] = in[i + begin];
out[end - begin] = '\0';
return out;
}
After all this, my thinking is no longer about whether I should use null-terminated strings: I'm thinking about why C uses them!
So my question is: are there any benefits to null-termination that I'm missing?
From Joel's Back to Basics:
Why do C strings work this way? It's because the PDP-7 microprocessor, on which UNIX and the C programming language were invented, had an ASCIZ string type. ASCIZ meant "ASCII with a Z (zero) at the end."
Is this the only way to store strings? No, in fact, it's one of the worst ways to store strings. For non-trivial programs, APIs, operating systems, class libraries, you should avoid ASCIZ strings like the plague.