I'm using FreeType2 in one of my projects. In order to render a letter, I need to provide a Unicode two-byte character code. The char codes a program reads are in ASCII one-byte format though. It poses no problem for char codes below 128 (the character codes are the same), but the other 128 do not match. For instance:
'a' in ASCII is 0x61, 'a' in Unicode is 0x0061 - that's fine
'ą' in ASCII is 0xB9, 'ą' in Unicode is 0x0105 - completely different
I was trying to use WinAPI functions there, but I must be doing something wrong. Here's a sample:
unsigned char szTest1[] = "ąółź"; //ASCII format
wchar_t* wszTest2;
int size = MultiByteToWideChar(CP_UTF8, 0, (char*)szTest1, 4, NULL, 0);
printf("size = %d\n", size);
wszTest2 = new wchar_t[size];
MultiByteToWideChar(CP_UTF8, 0, (char*)szTest1, 4, wszTest2, size);
printf("HEX: %x\n", wszTest2[0]);
delete[] wszTest2;
I'm expecting a new wide string to be created, with no NULL at the end. However, the size variable always equals 0. Any idea what I'm doing wrong? Or maybe there's an easier way to solve the problem?
The "pure" ASCII set of characters is restricted in range 0-127 (7 bits). The 8-bit characters with most significant bit set (i.e. those in range 128-255) are not uniquely defined: their definition depends on the code page.
So, your character ą
(LATIN SMALL LETTER A WITH OGONEK) is represented by the value 0xB9
in a particular code page, which should be Windows-1250. In other code pages, the value 0xB9
is associated to a different character (for example, in Windows 1252 code page, 0xB9
is associated to character ¹
, i.e. a superscript digit 1).
To convert characters from a particular code-page to Unicode UTF-16 using Windows Win32 APIs, you can use MultiByteToWideChar
, specifying the correct code page (which is not CP_UTF8
as written in the code in your question; in fact, CP_UTF8
identifies Unicode UTF-8). You may want to try specifying 1250
(ANSI Central European; Central European (Windows)) as proper code page identifier.
If you can have access to ATL in your code, you can use the convenience of ATL string conversion helper classes like CA2W
, which wraps the MultiByteToWideChar(
) call and memory allocation in a RAII class; e.g.:
#include <atlconv.h> // ATL String Conversion Helpers
// 'test' is a Unicode UTF-16 string.
// Conversion is done from code-page 1250
// (ANSI Central European; Central European (Windows))
CA2W test("ąółź", 1250);
Now you should be able to use test
string in your Unicode API's.
If you don't have access to ATL or want a C++ STL-based solution, you may want to consider some code like this:
///////////////////////////////////////////////////////////////////////////////
//
// Modern STL-based C++ wrapper to Win32's MultiByteToWideChar() C API.
//
// (based on http://code.msdn.microsoft.com/windowsdesktop/C-UTF-8-Conversion-Helpers-22c0a664)
//
///////////////////////////////////////////////////////////////////////////////
#include <exception> // for std::exception
#include <iostream> // for std::cout
#include <ostream> // for std::endl
#include <stdexcept> // for std::runtime_error
#include <string> // for std::string and std::wstring
#include <Windows.h> // Win32 Platform SDK
//-----------------------------------------------------------------------------
// Define an exception class for string conversion error.
//-----------------------------------------------------------------------------
class StringConversionException
: public std::runtime_error
{
public:
// Creates exception with error message and error code.
StringConversionException(const char* message, DWORD error)
: std::runtime_error(message)
, m_error(error)
{}
// Creates exception with error message and error code.
StringConversionException(const std::string& message, DWORD error)
: std::runtime_error(message)
, m_error(error)
{}
// Windows error code.
DWORD Error() const
{
return m_error;
}
private:
DWORD m_error;
};
//-----------------------------------------------------------------------------
// Converts an ANSI/MBCS string to Unicode UTF-16.
// Wraps MultiByteToWideChar() using modern C++ and STL.
// Throws a StringConversionException on error.
//-----------------------------------------------------------------------------
std::wstring ConvertToUTF16(const std::string & source, const UINT codePage)
{
// Fail if an invalid input character is encountered
static const DWORD conversionFlags = MB_ERR_INVALID_CHARS;
// Require size for destination string
const int utf16Length = ::MultiByteToWideChar(
codePage, // code page for the conversion
conversionFlags, // flags
source.c_str(), // source string
source.length(), // length (in chars) of source string
NULL, // unused - no conversion done in this step
0 // request size of destination buffer, in wchar_t's
);
if (utf16Length == 0)
{
const DWORD error = ::GetLastError();
throw StringConversionException(
"MultiByteToWideChar() failed: Can't get length of destination UTF-16 string.",
error);
}
// Allocate room for destination string
std::wstring utf16Text;
utf16Text.resize(utf16Length);
// Convert to Unicode UTF-16
if ( ! ::MultiByteToWideChar(
codePage, // code page for conversion
0, // validation was done in previous call
source.c_str(), // source string
source.length(), // length (in chars) of source string
&utf16Text[0], // destination buffer
utf16Text.length() // size of destination buffer, in wchar_t's
))
{
const DWORD error = ::GetLastError();
throw StringConversionException(
"MultiByteToWideChar() failed: Can't convert to UTF-16 string.",
error);
}
return utf16Text;
}
//-----------------------------------------------------------------------------
// Test.
//-----------------------------------------------------------------------------
int main()
{
// Error codes
static const int exitOk = 0;
static const int exitError = 1;
try
{
// Test input string:
//
// ą - LATIN SMALL LETTER A WITH OGONEK
std::string inText("x - LATIN SMALL LETTER A WITH OGONEK");
inText[0] = 0xB9;
// ANSI Central European; Central European (Windows) code page
static const UINT codePage = 1250;
// Convert to Unicode UTF-16
const std::wstring utf16Text = ConvertToUTF16(inText, codePage);
// Verify conversion.
// ą - LATIN SMALL LETTER A WITH OGONEK
// --> Unicode UTF-16 0x0105
// http://www.fileformat.info/info/unicode/char/105/index.htm
if (utf16Text[0] != 0x0105)
{
throw std::runtime_error("Wrong conversion.");
}
std::cout << "All right." << std::endl;
}
catch (const StringConversionException& e)
{
std::cerr << "*** ERROR:\n";
std::cerr << e.what() << "\n";
std::cerr << "Error code = " << e.Error();
std::cerr << std::endl;
return exitError;
}
catch (const std::exception& e)
{
std::cerr << "*** ERROR:\n";
std::cerr << e.what();
std::cerr << std::endl;
return exitError;
}
return exitOk;
}
///////////////////////////////////////////////////////////////////////////////