Cross-platform strings (and Unicode) in C++

user438380 picture user438380 · Nov 13, 2010 · Viewed 7k times · Source

So I've finally gotten back to my main task - porting a rather large C++ project from Windows to the Mac.

Straight away I've been hit by the problem where wchar_t is 16-bits on Windows but 32-bits on the Mac. This is a problem because all of the strings are represented by wchar_t and there will be string data going back and forth between Windows and Mac machines (in both on-disk data and network data forms). Because of the way in which it works it wouldn't be totally straightforward to convert the strings into some common format before sending and receiving the data.

We've also really started to support a lot more languages recently and so we're starting to deal with a lot of Unicode data (as well as dealing with right-to-left languages).

Now, I could be conflating multiple ideas here and causing more problems for myself than needed which is why I'm asking this question. We're thinking that storing all of our in-memory string data as UTF-8 makes a lot of sense. It solves the wchar_t being different sizes problem, it means we can easily support multiple languages and it also dramatically reduces our memory footprint (we have a LOT of - mostly English - strings loaded) - but it doesn't seem like many people are doing this. Is there something we're missing? There's the obvious problem you have to deal with where string length can be less than the memory size storing that string data.

Or is using UTF-16 a better idea? Or should we stick to wchar_t and write code to convert between wchar_t and, say, Unicode in places where we read/write to the disk or the network?

I realize this is dangerously close to asking for opinions - but we're nervous that we're overlooking something obvious because it doesn't seem like there are many Unicode string classes (for example) - but yet there's plenty of code for converting to/from Unicode like in boost::locale, iconv, utf-cpp and ICU.

Answer

aschepler picture aschepler · Nov 13, 2010

Always use a protocol defined to the byte when a file or network connection is involved. Do not rely on how a C++ compiler stores anything in memory. For Unicode text, this means choosing both an encoding and a byte order (okay, UTF-8 doesn't care about byte order). Even if the platforms you currently want to support have similar architectures, another popular platform with different behavior or even a new OS for one of your existing platforms will likely come along, and you'll be glad you wrote portable code.