QString to unicode std::string

Oleg Andriyanov picture Oleg Andriyanov · Apr 3, 2014 · Viewed 10.1k times · Source

I know there is plenty of information about converting QString to char*, but I still need some clarification in this question.

Qt provides QTextCodecs to convert QString (which internally stores characters in unicode) to QByteArray, allowing me to retrieve char* which represents the string in some non-unicode encoding. But what should I do when I want to get a unicode QByteArray?

QTextCodec* codec = QTextCodec::codecForName("UTF-8");
QString qstr = codec->toUnicode("Юникод");
std::string stdstr(reinterpret_cast<const char*>(qstr.constData()), qstr.size() * 2 );  // * 2 since unicode character is twice longer than char
qDebug() << QString(reinterpret_cast<const QChar*>(stdstr.c_str()), stdstr.size() / 2); // same

The above code prints "Юникод" as I've expected. But I'd like to know if that is the right way to get to the unicode char* of the QString. In particular, reinterpret_casts and size arithmetics in this technique looks pretty ugly.

Answer

The below applies to Qt 5. Qt 4's behavior was different and, in practice, broken.

You need to choose:

  1. Whether you want the 8-bit wide std::string or 16-bit wide std::wstring, or some other type.

  2. What encoding is desired in your target string?

Internally, QString stores UTF-16 encoded data, so any Unicode code point may be represented in one or two QChars.

Common cases:

  • Locally encoded 8-bit std::string (as in: system locale):

    std::string(str.toLocal8Bit().constData())
    
  • UTF-8 encoded 8-bit std::string:

    str.toStdString()
    

    This is equivalent to:

    std::string(str.toUtf8().constData())
    
  • UTF-16 or UCS-4 encoded std::wstring, 16- or 32 bits wide, respectively. The selection of 16- vs. 32-bit encoding is done by Qt to match the platform's width of wchar_t.

    str.toStdWString()
    
  • U16 or U32 strings of C++11 - from Qt 5.5 onwards:

    str.toStdU16String()
    str.toStdU32String()
    
  • UTF-16 encoded 16-bit std::u16string - this hack is only needed up to Qt 5.4:

    std::u16string(reinterpret_cast<const char16_t*>(str.constData()))
    

    This encoding does not include byte order marks (BOMs).

It's easy to prepend BOMs to the QString itself before converting it:

QString src = ...;
src.prepend(QChar::ByteOrderMark);
#if QT_VERSION < QT_VERSION_CHECK(5,5,0)
auto dst = std::u16string{reinterpret_cast<const char16_t*>(src.constData()),
                          src.size()};
#else
auto dst = src.toStdU16String();

If you expect the strings to be large, you can skip one copy:

const QString src = ...;
std::u16string dst;
dst.reserve(src.size() + 2); // BOM + termination
dst.append(char16_t(QChar::ByteOrderMark));
dst.append(reinterpret_cast<const char16_t*>(src.constData()),
           src.size()+1);

In both cases, dst is now portable to systems with either endianness.