Honestly, I just don't get the following design decision in C++ Standard library. When writing wide characters to a file, the wofstream
converts wchar_t
into char
characters:
#include <fstream>
#include <string>
int main()
{
using namespace std;
wstring someString = L"Hello StackOverflow!";
wofstream file(L"Test.txt");
file << someString; // the output file will consist of ASCII characters!
}
I am aware that this has to do with the standard codecvt
. There is codecvt
for utf8
in Boost
. Also, there is a codecvt
for utf16
by Martin York here on SO. The question is why the standard codecvt
converts wide-characters? why not write the characters as they are!
Also, are we gonna get real unicode streams
with C++0x or am I missing something here?
A very partial answer for the first question: A file is a sequence of bytes so, when dealing with wchar_t
's, at least some conversion between wchar_t
and char
must occur. Making this conversion "intelligently" requires knowledge of the character encodings, so this is why this conversion is allowed to be locale-dependent, by virtue of using a facet in the stream's locale.
Then, the question is how that conversion should be made in the only locale required by the standard: the "classic" one. There is no "right" answer for that, and the standard is thus very vague about it. I understand from your question that you assume that blindly casting (or memcpy()-ing) between wchar_t[] and char[] would have been a good way. This is not unreasonable, and is in fact what is (or at least was) done in some implementations.
Another POV would be that, since a codecvt is a locale facet, it is reasonable to expect that the conversion is made using the "locale's encoding" (I'm handwavy here, as the concept is pretty fuzzy). For example, one would expect a Turkish locale to use ISO-8859-9, or a Japanese on to use Shift JIS. By similarity, the "classic" locale would convert to this "locale's encoding". Apparently, Microsoft chose to simply trim (which leads to IS-8859-1 if we assuming that wchar_t
represents UTF-16 and that we stay in the basic multilingual plane), while the Linux implementation I know about decided stick to ASCII.
For your second question:
Also, are we gonna get real unicode streams with C++0x or am I missing something here?
In the [locale.codecvt] section of n2857 (the latest C++0x draft I have at hand), one can read:
The specialization
codecvt<char16_t, char, mbstate_t>
converts between the UTF-16 and UTF-8 encodings schemes, and the specializationcodecvt <char32_t, char, mbstate_t>
converts between the UTF-32 and UTF-8 encodings schemes.codecvt<wchar_t,char,mbstate_t>
converts between the native character sets for narrow and wide characters.
In the [locale.stdcvt] section, we find:
For the facet
codecvt_utf8
: — The facet shall convert between UTF-8 multibyte sequences and UCS2 or UCS4 (depending on the size of Elem) within the program. [...]For the facet
codecvt_utf16
: — The facet shall convert between UTF-16 multibyte sequences and UCS2 or UCS4 (depending on the size of Elem) within the program. [...]For the facet
codecvt_utf8_utf16
: — The facet shall convert between UTF-8 multibyte sequences and UTF-16 (one or two 16-bit codes) within the program.
So I guess that this means "yes", but you'd have to be more precise about what you mean by "real unicode streams" to be sure.