Convert QString into QByteArray with either UTF-8 or Latin1 encoding

Johan picture Johan · Mar 13, 2011 · Viewed 20.4k times · Source

I would like to covert a QString into either a utf8 or a latin1 QByteArray, but today I get everything as utf8.

And I am testing this with some char in the higher segment of latin1 higher than 0x7f, where the german ü is a good example.

If I do like this:

QString name("\u00fc"); // U+00FC = ü
QByteArray utf8;
utf8.append(name);
qDebug() << "utf8" << name << utf8.toHex();

QByteArray latin1;
latin1.append(name.toLatin1());
qDebug() << "Latin1" << name << latin1.toHex();

QTextCodec *codec = QTextCodec::codecForName("ISO 8859-1");
QByteArray encodedString = codec->fromUnicode(name);
qDebug() << "ISO 8859-1" << name << encodedString.toHex();

I get the following output.

utf8 "ü" "c3bc" 
Latin1 "ü" "c3bc" 
ISO 8859-1 "ü" "c3bc" 

As you can see I get the unicode 0xc3bc everywhere, where I would expect to get the Latin1 0xfc for step 2 and 3.

My guess is that I should get something like this:

utf8 "ü" "c3bc" 
Latin1 "ü" "fc" 
ISO 8859-1 "ü" "fc" 

What is going on here?

/Thanks


Links to some character tables:


This code was build and executed on a Ubuntu 10.04 based system.

$> uname -a
Linux frog 2.6.32-28-generic-pae #55-Ubuntu SMP Mon Jan 10 22:34:08 UTC 2011 i686 GNU/Linux
$> env | grep LANG
LANG=en_US.utf8

And if I try to use

utf8.append(name.toUtf8());

I get this output

utf8 "ü" "c383c2bc" 
Latin1 "ü" "c3bc" 
ISO 8859-1 "ü" "c3bc" 

So the latin1 is unicode and the utf8 is double encoded...

This must depend on some system settings?


If I run this (could not get the .name() to build)

qDebug() << "system name:"      << QLocale::system().name();
qDebug() << "codecForCStrings:" << QTextCodec::codecForCStrings();
qDebug() << "codecForLocale:"   << QTextCodec::codecForLocale()->name();

Then I get this:

system name: "en_US" 
codecForCStrings: 0x0 
codecForLocale: "System" 

Solution

If I specify that it is UTF-8 I am using so the different classes know about this, then it works.

QTextCodec::setCodecForLocale(QTextCodec::codecForName("UTF-8"));
QTextCodec::setCodecForCStrings(QTextCodec::codecForName("UTF-8"));

qDebug() << "system name:"      << QLocale::system().name();
qDebug() << "codecForCStrings:" << QTextCodec::codecForCStrings()->name();
qDebug() << "codecForLocale:"   << QTextCodec::codecForLocale()->name();

QString name("\u00fc"); 
QByteArray utf8;
utf8.append(name);
qDebug() << "utf8" << name << utf8.toHex();

QByteArray latin1;
latin1.append(name.toLatin1());
qDebug() << "Latin1" << name << latin1.toHex();

QTextCodec *codec = QTextCodec::codecForName("latin1");
QByteArray encodedString = codec->fromUnicode(name);
qDebug() << "ISO 8859-1" << name << encodedString.toHex();

Then I get this output:

system name: "en_US" 
codecForCStrings: "UTF-8" 
codecForLocale: "UTF-8" 
utf8 "ü" "c3bc" 
Latin1 "ü" "fc" 
ISO 8859-1 "ü" "fc" 

And that looks like it should.

Answer

Piotr Dobrogost picture Piotr Dobrogost · Mar 13, 2011

Things to know:

  • execution character page

There's something called execution character set in the C++ standard which is the term that describes what the output of string and character literals will be in the binary produced by compiler. You can read about it in the 1.1 Character sets subsection of 1 Overview section in The C Preprocessor's Manual on http://gcc.gnu.org site.

Question:
What will be produced as a result of "\u00fc" string literal?

Answer:
It depends on what the execution character set is. In case of gcc (which is what you're using) it's by default UTF-8 unless you specify something different with -fexec-charset option. You can read about this and other options controlling preprocessing phase in the 3.11 Options Controlling the Preprocessor subsection of 3 GCC Command Options section in GCC's Manual on http://gcc.gnu.org site. Now when we know that execution character set is UTF-8 we know that "\u00fc" will be translated to UTF-8 encoding of U+00FC Unicode's code point which is a sequence of two bytes 0xc3 0xbc.

The QString's constructor taking char * calls QString QString::fromAscii ( const char * str, int size = -1 ) which uses codec set with void QTextCodec::setCodecForCStrings ( QTextCodec * codec ) (if any codec had been set) or does the same thing as QString QString::fromLatin1 ( const char * str, int size = -1 ) (in case no codec had been set).

Question:
What codec will be used by QString's constructor to decode two byte sequence (0xc3 0xbc) it gets?

Answer:
By default no codec is set with QTextCodec::setCodecForCStrings() that's why Latin1 will be used to decode byte sequence. As 0xc3 and 0xbc are both valid in Latin 1, representing respectively à and ¼ (this should already be familiar to you as it was taken directly from this answer to your earlier question) we get QString with these two characters.

You shouldn't use QDebug class to output anything outside of ASCII. You have no guarantee what you get.

Test program:

#include <QtCore>

void dbg(char const * rawInput, QString s) {

    QString codepoints;
    foreach(QChar chr, s) {
        codepoints.append(QString::number(chr.unicode(), 16)).append(" ");
    }

    qDebug() << "Input: " << rawInput
             << ", "
             << "Unicode codepoints: " << codepoints;
}

int main(int argc, char *argv[])
{
    QCoreApplication app(argc, argv);

    qDebug() << "system name:"
             << QLocale::system().name();

    for (int i = 1; i <= 5; ++i) {

        switch(i) {

        case 1:
            qDebug() << "\nWithout codecForCStrings (default is Latin1)\n";
            break;
        case 2:
            qDebug() << "\nWith codecForCStrings set to UTF-8\n";
            QTextCodec::setCodecForCStrings(QTextCodec::codecForName("UTF-8"));
            break;
        case 3:
            qDebug() << "\nWithout codecForCStrings (default is Latin1), with codecForLocale set to UTF-8\n";
            QTextCodec::setCodecForCStrings(0);
            QTextCodec::setCodecForLocale(QTextCodec::codecForName("UTF-8"));
            break;
        case 4:
            qDebug() << "\nWithout codecForCStrings (default is Latin1), with codecForLocale set to Latin1\n";
            QTextCodec::setCodecForCStrings(0);
            QTextCodec::setCodecForLocale(QTextCodec::codecForName("Latin1"));
            break;
        }

        qDebug() << "codecForCStrings:" << (QTextCodec::codecForCStrings()
                                           ? QTextCodec::codecForCStrings()->name()
                                           : "NOT SET");
        qDebug() << "codecForLocale:"   << (QTextCodec::codecForLocale()
                                           ? QTextCodec::codecForLocale()->name()
                                           : "NOT SET");

        qDebug() << "\n1. Using QString::QString(char const *)";
        dbg("\\u00fc", QString("\u00fc"));
        dbg("\\xc3\\xbc", QString("\xc3\xbc"));
        dbg("LATIN SMALL LETTER U WITH DIAERESIS", QString("ü"));

        qDebug() << "\n2. Using QString::fromUtf8(char const *)";
        dbg("\\u00fc", QString::fromUtf8("\u00fc"));
        dbg("\\xc3\\xbc", QString::fromUtf8("\xc3\xbc"));
        dbg("LATIN SMALL LETTER U WITH DIAERESIS", QString::fromUtf8("ü"));

        qDebug() << "\n3. Using QString::fromLocal8Bit(char const *)";
        dbg("\\u00fc", QString::fromLocal8Bit("\u00fc"));
        dbg("\\xc3\\xbc", QString::fromLocal8Bit("\xc3\xbc"));
        dbg("LATIN SMALL LETTER U WITH DIAERESIS", QString::fromLocal8Bit("ü"));
    }

    return app.exec();
}

Output on mingw 4.4.0 on Windows XP:

system name: "pl_PL"

Without codecForCStrings (default is Latin1)

codecForCStrings: "NOT SET"
codecForLocale: "System"

1. Using QString::QString(char const *)
Input:  \u00fc ,  Unicode codepoints:  "c3 bc "
Input:  \xc3\xbc ,  Unicode codepoints:  "c3 bc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fc "

2. Using QString::fromUtf8(char const *)
Input:  \u00fc ,  Unicode codepoints:  "fc "
Input:  \xc3\xbc ,  Unicode codepoints:  "fc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fffd "

3. Using QString::fromLocal8Bit(char const *)
Input:  \u00fc ,  Unicode codepoints:  "102 13d "
Input:  \xc3\xbc ,  Unicode codepoints:  "102 13d "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fc "

With codecForCStrings set to UTF-8

codecForCStrings: "UTF-8"
codecForLocale: "System"

1. Using QString::QString(char const *)
Input:  \u00fc ,  Unicode codepoints:  "fc "
Input:  \xc3\xbc ,  Unicode codepoints:  "fc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fffd "

2. Using QString::fromUtf8(char const *)
Input:  \u00fc ,  Unicode codepoints:  "fc "
Input:  \xc3\xbc ,  Unicode codepoints:  "fc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fffd "

3. Using QString::fromLocal8Bit(char const *)
Input:  \u00fc ,  Unicode codepoints:  "102 13d "
Input:  \xc3\xbc ,  Unicode codepoints:  "102 13d "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fc "

Without codecForCStrings (default is Latin1), with codecForLocale set to UTF-8

codecForCStrings: "NOT SET"
codecForLocale: "UTF-8"

1. Using QString::QString(char const *)
Input:  \u00fc ,  Unicode codepoints:  "c3 bc "
Input:  \xc3\xbc ,  Unicode codepoints:  "c3 bc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fc "

2. Using QString::fromUtf8(char const *)
Input:  \u00fc ,  Unicode codepoints:  "fc "
Input:  \xc3\xbc ,  Unicode codepoints:  "fc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fffd "

3. Using QString::fromLocal8Bit(char const *)
Input:  \u00fc ,  Unicode codepoints:  "fc "
Input:  \xc3\xbc ,  Unicode codepoints:  "fc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fffd "

Without codecForCStrings (default is Latin1), with codecForLocale set to Latin1

codecForCStrings: "NOT SET"
codecForLocale: "ISO-8859-1"

1. Using QString::QString(char const *)
Input:  \u00fc ,  Unicode codepoints:  "c3 bc "
Input:  \xc3\xbc ,  Unicode codepoints:  "c3 bc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fc "

2. Using QString::fromUtf8(char const *)
Input:  \u00fc ,  Unicode codepoints:  "fc "
Input:  \xc3\xbc ,  Unicode codepoints:  "fc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fffd "

3. Using QString::fromLocal8Bit(char const *)
Input:  \u00fc ,  Unicode codepoints:  "c3 bc "
Input:  \xc3\xbc ,  Unicode codepoints:  "c3 bc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fc "
codecForCStrings: "NOT SET"
codecForLocale: "ISO-8859-1"

1. Using QString::QString(char const *)
Input:  \u00fc ,  Unicode codepoints:  "c3 bc "
Input:  \xc3\xbc ,  Unicode codepoints:  "c3 bc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fc "

2. Using QString::fromUtf8(char const *)
Input:  \u00fc ,  Unicode codepoints:  "fc "
Input:  \xc3\xbc ,  Unicode codepoints:  "fc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fffd "

3. Using QString::fromLocal8Bit(char const *)
Input:  \u00fc ,  Unicode codepoints:  "c3 bc "
Input:  \xc3\xbc ,  Unicode codepoints:  "c3 bc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fc "

I'd like to thank thiago, cbreak, peppe and heinz from #qt freenode.org IRC channel for showing and helping me to understand issues involved here.