I tried to use UTF-8 and ran into trouble.
I have tried so many things; here are the results I have gotten:
????
instead of Asian characters. Even for European text, I got Se?or
for Señor
.Señor
or 新浪新闻
for 新浪新闻
.Se
for Señor
.What am I doing wrong? How can I fix the code? Can I recover the data, if so, how?
This problem plagues the participants of this site, and many others.
You have listed the five main cases of CHARACTER SET
troubles.
Best Practice
Going forward, it is best to use CHARACTER SET utf8mb4
and COLLATION utf8mb4_unicode_520_ci
. (There is a newer version of the Unicode collation in the pipeline.)
utf8mb4
is a superset of utf8
in that it handles 4-byte utf8 codes, which are needed by Emoji and some of Chinese.
Outside of MySQL, "UTF-8" refers to all size encodings, hence effectively the same as MySQL's utf8mb4
, not utf8
.
I will try to use those spellings and capitalizations to distinguish inside versus outside MySQL in the following.
Overview of what you should do
<form accept-charset="UTF-8">
.CHARACTER SET utf8mb4
(Check with SHOW CREATE TABLE
.)<meta charset=UTF-8>
at the beginning of HTMLMore details for computer languages (and its following sections)
Test the data
Viewing the data with a tool or with SELECT
cannot be trusted.
Too many such clients, especially browsers, try to compensate for incorrect encodings, and show you correct text even if the database is mangled.
So, pick a table and column that has some non-English text and do
SELECT col, HEX(col) FROM tbl WHERE ...
The HEX for correctly stored UTF-8 will be
20
4x
, 5x
, 6x
, or 7x
Cxyy
Dxyy
Exyyzz
F0yyzzww
Specific causes and fixes of the problems seen
Truncated text (Se
for Señor
):
Black Diamonds with question marks (Se�or
for Señor
);
one of these cases exists:
Case 1 (original bytes were not UTF-8):
SET NAMES
) for the INSERT
and the SELECT
was not utf8/utf8mb4. Fix this.CHARACTER SET utf8
(or utf8mb4).Case 2 (original bytes were UTF-8):
SET NAMES
) for the SELECT
was not utf8/utf8mb4. Fix this.CHARACTER SET utf8
(or utf8mb4).Black diamonds occur only when the browser is set to <meta charset=UTF-8>
.
Question Marks (regular ones, not black diamonds) (Se?or
for Señor
):
CHARACTER SET utf8
(or utf8mb4). Fix this. (Use SHOW CREATE TABLE
.)Mojibake (Señor
for Señor
):
(This discussion also applies to Double Encoding, which is not necessarily visible.)
INSERTing
and SELECTing
text needs to specify utf8 or utf8mb4. Fix this.CHARACTER SET utf8
(or utf8mb4). Fix this.<meta charset=UTF-8>
.If the data looks correct, but won't sort correctly, then either you have picked the wrong collation, or there is no collation that suits your need, or you have Double Encoding.
Double Encoding can be confirmed by doing the SELECT .. HEX ..
described above.
é should come back C3A9, but instead shows C383C2A9
The Emoji 👽 should come back F09F91BD, but comes back C3B0C5B8E28098C2BD
That is, the hex is about twice as long as it should be.
This is caused by converting from latin1 (or whatever) to utf8, then treating those
bytes as if they were latin1 and repeating the conversion.
The sorting (and comparing) does not work correctly because it is, for example,
sorting as if the string were Señor
.
Fixing the Data, where possible
For Truncation and Question Marks, the data is lost.
For Mojibake / Double Encoding, ...
For Black Diamonds, ...
The Fixes are listed here. (5 different fixes for 5 different situations; pick carefully): http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases