I'm trying to work out a class reading & writing files. for strings there is two ways of it: ANSI and Unicode. ANSI functions' alright but something wrong with my Unicode ones.
it's a bit wired that I can read Unicode files just fine, directly I mean, without checking or skipping "0xFEFF" stuff. it works no matter what language I'm in (I tried English, Chinese and Japanese). is there any thing I should know about?
then the biggest problem jumped out: to write Unicode strings into a file. first I tried simple English as alphabet without '\n' character, it did works great. then i push '\n' into and things start going wrong: the output is inserted with many spaces as "a b c d e f g \n h i j k l m n \n o p q r s t \n u v w x y z " ('\n' works but so many spaces) and the file is ANSI again. don't ask characters in other languages, I can't even read them at all.
so here is the question: what should I do to write Unicode strings into a file correctly and how? don't mention "_wopen" function please, the file is already opened with "fopen" function.
answers & advises would be so appreciated.
I'm using Windows 7 and visual studio.
Edit: it works for non-English characters with following code but still wrong with '\n'.
char* cStart = "\xff\xfe";
if (::ftell(m_pFile) == 0)
::fwrite(cStart, sizeof(wchar_t), 1, m_pFile);
but how that works? I mean I didn't see it while I was reading the file.
Edit: part of my code.
void File::ReadWText(wchar_t* pString, uint32 uLength)
{
wchar_t cLetter = L'\0';
uint32 uIndex = 0;
do {
cLetter = L'\0';
::fread(&cLetter, sizeof(wchar_t), 1, m_pFile);
pString[uIndex] = cLetter;
}while (cLetter != L'\0' && !::feof(m_pFile) && uIndex++ < uLength);
pString[uIndex] = L'\0';
}
void File::WriteWText(wchar_t* pString, uint32 uLength)
{
char* pStart = "\xff\xfe";
if (::ftell(m_pFile) == 0)
::fwrite(pStart, sizeof(wchar_t), 1, m_pFile);
m_uSize += sizeof(wchar_t) * ::fwrite(pString, sizeof(wchar_t), uLength, m_pFile);
}
void main()
{
::File* pFile = new File();
wchar_t* pWString = L"abcdefg\nhijklmn\nopqrst\nuvwxyz";
pFile->Open("TextW.txt", File::Output);
// fopen("TextW.txt", "w");
pFile->WriteWText(pWString, ::wcslen(pWString));
pFile->Close();
}
output file's content is: "abcdefg栀椀樀欀氀洀渀ഀopqrst甀瘀眀砀礀稀", file's in Unicode.
I don't know whether it's the right expression of "L'\n'", I've never worked with Unicode before. thanks for helping me :)
I just noticed that this question is tagged C and C++: below is discussing the situation in C++. It is entirely ignoring the use of and I don't know how to deal with different encodings using .
When reading or writing a file you need to tell the system what the encoding of the file is so it can convert the bytes in the file into characters internal to the program when reading and converting characters to bytes when writing. In many cases, this conversion is entirely ignored because the conversion from bytes to characters is the identity: the bytes can be interpreted as characters and vice versa. This is true when the external encoding is ASCII (I assume this is referred to as "ANSI" in your question).
Pretending that UTF-8 encoded files use the identity transformation to convert from bytes to character works to some extends. The original vision of the internal character representation in C++ was to have one unit per character, e.g. a char
or a wchar_t
. Although Unicode had set out with a set of goals which would work nicely with this (e.g. each character is represented by one unit and the unit size is 16 bit), they felt to sacrifice all of their original goals and we ended up with a system where one character (well, I think they are actually called "code points" but I'm not a Unicode expert) can consist of multiple words (e.g. when using combining characters). In any case, as long as individual units don't get mutated without paying attention to character, it is generally possible to process UTF-8 as a sequence of char
(e.g. as std::string
) and UTF-16 as a sequence of wchar_t
(e.g. as std::wstring
). However, when reading something different than UTF-8 (or ASCII which is a subset of UTF-8) you need to be careful to set up the stream such that it know which encoding is used.
The standard way to set up a file stream to know about a specific encoding is to create a suitable std::locale
which contains a corresponding std::codecvt<...>
facet converting between the external bytes and the internal characters using its specific encoding. How to actually get a corresponding std::locale
is up to the individual implementation. The default conversion is meant to do pretend the program uses an extension of ASCII which covers all values of char
. When reading and writing UTF-8 this should just work.
I'm not sure what you mean with "writing Unicode strings" but from the looks of it you are writing a std::wstring
without setting up an encoding.