How to get byte size of multibyte string

flacs picture flacs · Jul 29, 2010 · Viewed 10.4k times · Source

How do I get the byte size of a multibyte-character string in Visual C? Is there a function or do I have to count the characters myself?

Or, more general, how do I get the right byte size of a TCHAR string?

Solution:

_tcslen(_T("TCHAR string")) * sizeof(TCHAR)

EDIT:
I was talking about null-terminated strings only.

Answer

Thanatos picture Thanatos · Jul 29, 2010

Let's see if I can clear this up:

"Multi-byte character string" is a vague term to begin with, but in the world of Microsoft, it typically meants "not ASCII, and not UTF-16". Thus, you could be using some character encoding which might use 1 byte per character, or 2 bytes, or possibly more. As soon as you do, the number of characters in the string != the number of bytes in the string.

Let's take UTF-8 as an example, even though it isn't used on MS platforms. The character é is encoded as "c3 a9" in memory -- thus, two bytes, but 1 character. If I have the string "thé", it's:

text: t  h  é     \0
mem:  74 68 c3 a9 00

This is a "null terminated" string, in that it ends with a null. If we wanted to allow our string to have nulls in it, we'd need to store the size in some other fashion, such as:

struct my_string
{
    size_t length;
    char *data;
};

... and a slew of functions to help deal with that. (This is sort of how std::string works, quite roughly.)

For null-terminated strings, however, strlen() will compute their size in bytes, not characters. (There are other functions for counting characters) strlen just counts the number of bytes before it sees a 0 byte -- nothing fancy.

Now, "wide" or "unicode" strings in the world of MS refer to UTF-16 strings. They have similar problems in that the number of bytes != the number of characters. (Also: the number of bytes / 2 != the number of characters) Let look at thé again:

text:   t      h      é      \0
shorts: 0x0074 0x0068 0x00e9 0x0000
mem:    74 00  68 00  e9 00  00 00

That's "thé" in UTF-16, stored in little endian (which is what your typical desktop is). Notice all the 00 bytes -- these trip up strlen. Thus, we call wcslen, which looks at it as 2-byte shorts, not single bytes.

Lastly, you have TCHARs, which are one of the above two cases, depending on if UNICODE is defined. _tcslen will be the appropriate function (either strlen or wcslen), and TCHAR will be either char or wchar_t. TCHAR was created to ease the move to UTF-16 in the Windows world.