Unicode string literals

rubenvb picture rubenvb · Oct 3, 2011 · Viewed 14.8k times · Source

C++11 introduces a new set of string literal prefixes (and even allows user-defined suffixes). On top of this, you can directly use Unicode escape sequences to code a certain symbol without having to worry about encoding.

const char16_t* s16 = u"\u00DA";
const char32_t* s32 = U"\u00DA";

But can I use the unicode escape sequences in wchar_t string literals as well? It would seem to be a defect if this wasn't possible.

const wchar_t* sw = L"\u00DA";

The integer value of sw[0] would of course depend on what wchar_t is on a particular platform, but to all other effects, this should be portable, no?

Answer

Kerrek SB picture Kerrek SB · Oct 3, 2011

It would work, but it may not have the desired semantics. \u00DA will expand into as many target characters as necessary for UTF8/16/32 encoding, depending on the size of wchar_t, but bear in mind that wide strings do not have any documented, guaranteed encoding semantics -- they're simply "the system's encoding", with no attempt made to say what that is, or require the user to know what that is.

So it's best not to mix and match. Use either one, but not both, of the two:

  1. system-specific: char*/"", wchar_t*/L"", \x-literals, mbstowcs/wcstombs

  2. Unicode: char*/u8"", char16_t*/u"", char32_t*/U"", \u/\U literals.

(Here are some related questions of mine on the subject.)