UTF-8 Compatibility in C++

Qman picture Qman · Aug 20, 2012 · Viewed 15.3k times · Source

I am writing a program that needs to be able to work with text in all languages. My understanding is that UTF-8 will do the job, but I am experiencing a few problems with it.

Am I right to say that UTF-8 can be stored in a simple char in C++? If so, why do I get the following warning when I use a program with char, string and stringstream: warning C4566: character represented by universal-character-name '\uFFFD' cannot be represented in the current code page (1252). (I do not get that error when I use wchar_t, wstring and wstringstream.)

Additionally, I know that UTF is variable length. When I use the at or substr string methods would I get the wrong answer?

Answer

R. Martinho Fernandes picture R. Martinho Fernandes · Aug 20, 2012

To use UTF-8 string literals you need to prefix them with u8, otherwise you get the implementation's character set (in your case, it seems to be Windows-1252): u8"\uFFFD" is null-terminated sequence of bytes with the UTF-8 representation of the replacement character (U+FFFD). It has type char const[4].

Since UTF-8 has variable length, all kinds of indexing will do indexing in code units, not codepoints. It is not possible to do random access on codepoints in an UTF-8 sequence because of it's variable length nature. If you want random access you need to use a fixed length encoding, like UTF-32. For that you can use the U prefix on strings.