How can a file contain null bytes?

RK. picture RK. · Jan 5, 2016 · Viewed 14.9k times · Source

How is it possible that files can contain null bytes in operating systems written in a language with null-terminating strings (namely, C)?

For example, if I run this shell code:

$ printf "Hello\00, World!" > test.txt
$ xxd test.txt
0000000: 4865 6c6c 6f00 2c20 576f 726c 6421       Hello., World!

I see a null byte in test.txt (at least in OS X). If C uses null-terminating strings, and OS X is written in C, then how come the file isn't terminated at the null byte, resulting in the file containing Hello instead of Hello\00, World!? Is there a fundamental difference between files and strings?

Answer

dbush picture dbush · Jan 5, 2016

Null-terminated strings are a C construct used to determine the end of a sequence of characters intended to be used as a string. String manipulation functions such as strcmp, strcpy, strchr, and others use this construct to perform their duties.

But you can still read and write binary data that contains null bytes within your program as well as to and from files. You just can't treat them as strings.

Here's an example of how this works:

#include <stdio.h>
#include <stdlib.h>

int main()
{
    FILE *fp = fopen("out1","w");
    if (fp == NULL) {
        perror("fopen failed");
        exit(1);
    }

    int a1[] = { 0x12345678, 0x33220011, 0x0, 0x445566 };
    char a2[] =  { 0x22, 0x33, 0x0, 0x66 };
    char a3[] = "Hello\x0World";

    // this writes the whole array
    fwrite(a1, sizeof(a1[0]), 4, fp);
    // so does this
    fwrite(a2, sizeof(a2[0]), 4, fp);
    // this does not write the whole array -- only "Hello" is written
    fprintf(fp, "%s\n", a3);
    // but this does
    fwrite(a3, sizeof(a3[0]), 12, fp);
    fclose(fp);
    return 0;
}

Contents of out1:

[dbush@db-centos tmp]$ xxd out1
0000000: 7856 3412 1100 2233 0000 0000 6655 4400  xV4..."3....fUD.
0000010: 2233 0066 4865 6c6c 6f0a 4865 6c6c 6f00  "3.fHello.Hello.
0000020: 576f 726c 6400                           World.

For the first array, because we use the fwrite function and tell it to write 4 elements the size of an int, all the values in the array appear in the file. You can see from the output that all values are written, the values are 32-bit, and each value is written in little-endian byte order. We can also see that the second and fourth elements of the array each contain one null byte, while the third value being 0 has 4 null bytes, and all appear in the file.

We also use fwrite on the second array, which contains elements of type char, and we again see that all array elements appear in the file. In particular, the third value in the array is 0, which consists of a single null byte that also appears in the file.

The third array is first written with the fprintf function using a %s format specifier which expects a string. It writes the first 5 bytes of this array to the file before encountering the null byte, after which it stops reading the array. It then prints a newline character (0x0a) as per the format.

The third array it written to the file again, this time using fwrite. The string constant "Hello\x0World" contains 12 bytes: 5 for "Hello", one for the explicit null byte, 5 for "World", and one for the null byte that implicitly ends the string constant. Since fwrite is given the full size of the array (12), it writes all of those bytes. Indeed, looking at the file contents, we see each of those bytes.

As a side note, in each of the fwrite calls, I've hardcoded the size of the array for the third parameter instead of using a more dynamic expression such as sizeof(a1)/sizeof(a1[0]) to make it more clear exactly how many bytes are being written in each case.