I am writing a program that needs to be able to work with text in all languages. My understanding is that UTF-8 will do the job, but I am experiencing a few problems with it.
Am I right to say that UTF-8 can be stored in a simple char
in C++? If so, why do I get the following warning when I use a program with char
, string
and stringstream
: warning C4566: character represented by universal-character-name '\uFFFD' cannot be represented in the current code page (1252)
. (I do not get that error when I use wchar_t
, wstring
and wstringstream
.)
Additionally, I know that UTF is variable length. When I use the at
or substr
string methods would I get the wrong answer?
To use UTF-8 string literals you need to prefix them with u8
, otherwise you get the implementation's character set (in your case, it seems to be Windows-1252): u8"\uFFFD"
is null-terminated sequence of bytes with the UTF-8 representation of the replacement character (U+FFFD). It has type char const[4]
.
Since UTF-8 has variable length, all kinds of indexing will do indexing in code units, not codepoints. It is not possible to do random access on codepoints in an UTF-8 sequence because of it's variable length nature. If you want random access you need to use a fixed length encoding, like UTF-32. For that you can use the U
prefix on strings.