C/C++ Why to use unsigned char for binary data?

Question 1

C/C++ Why to use unsigned char for binary data?

c++ c character-encoding bytebuffer rawbytestring

nightlytrails · Nov 30, 2012 · Viewed 27.7k times · Source

Answer

Answer

In C the unsigned char data type is the only data type that has all the following three properties simultaneously

it has no padding bits, that it where all storage bits contribute to the value of the data
no bitwise operation starting from a value of that type, when converted back into that type, can produce overflow, trap representations or undefined behavior
it may alias other data types without violating the "aliasing rules", that is that access to the same data through a pointer that is typed differently will be guaranteed to see all modifications

if these are the properties of a "binary" data type you are looking for, you definitively should use unsigned char.

For the second property we need a type that is unsigned. For these all conversion are defined with modulo arihmetic, here modulo UCHAR_MAX+1, 256 in most 99% of the architectures. All conversion of wider values to unsigned char thereby just corresponds to truncation to the least significant byte.

The two other character types generally don't work the same. signed char is signed, anyhow, so conversion of values that don't fit it is not well defined. char is not fixed to be signed or unsigned, but on a particular platform to which your code is ported it might be signed even it is unsigned on yours.

Question 2

Is it really necessary to use unsigned char to hold binary data as in some libraries which work on character encoding or binary buffers? To make sense of my question, have a look at the code below -

char c[5], d[5];
c[0] = 0xF0;
c[1] = 0xA4;
c[2] = 0xAD;
c[3] = 0xA2;
c[4] = '\0';

printf("%s\n", c);
memcpy(d, c, 5);
printf("%s\n", d);

both the printf's output 𤭢 correctly, where f0 a4 ad a2 is the encoding for the Unicode code-point U+24B62 (𤭢) in hex.

Even memcpy also correctly copied the bits held by a char.

What reasoning could possibly advocate the use of unsigned char instead of a plain char?

In other related questions unsigned char is highlighted because it is the only (byte/smallest) data type which is guaranteed to have no padding by the C-specification. But as the above example showed, the output doesn't seem to be affected by any padding as such.

I have used VC++ Express 2010 and MinGW to compile the above. Although VC gave the warning

warning C4309: '=' : truncation of constant value

the output doesn't seems to reflect that.

P.S. This could be marked a possible duplicate of Should a buffer of bytes be signed or unsigned char buffer? but my intent is different. I am asking why something which seems to be working as fine with char should be typed unsigned char?

Update: To quote from N3337,

Section 3.9 Types

2 For any object (other than a base-class subobject) of trivially copyable type T, whether or not the object holds a valid value of type T, the underlying bytes (1.7) making up the object can be copied into an array of char or unsigned char. If the content of the array of char or unsigned char is copied back into the object, the object shall subsequently hold its original value.

In view of the above fact and that my original example was on Intel machine where char defaults to signed char, am still not convinced if unsigned char should be preferred over char.

Anything else?

C/C++ Why to use unsigned char for binary data?

Answer

Related questions