mysqldump exporting data in a bad character set

user2149321 picture user2149321 · Mar 30, 2013 · Viewed 9.4k times · Source

Yesterday for the first time I exported my Mysql database and I found some very strange characters in the dump such as:

INSERT INTO `piwik_archive_blob_2013_01` VALUES (15,'Actions_actions_url_6',1,'2013-01-17','2013-01-17',1,'2013-01-20 07:36:53','xuNM0ý/œ#&ÝÕ³\ZõNYpÊÀì#!üw7Hж}°ÀAáZoN*šgµ\'GWª[Yûðe¯57 ÃÁÆ7|Ÿ\'Ü%µDh©-EÛ^ËL±ÕÞtªk@(,b±ßZ.ÒÃ6b²aiÓÍ)87[­ïÎœ,æya¥uÒ<|+íª7MNuïÝ¿8ñ%1Ʊ>Ú­X');

The version of my server MySQL is: 5.1.66-0+squeeze1 (Debian). This database was created automatically by the Piwik setup script.

Here is what I tried to fix this problem:

#1 First I checked the database charset.

> show table status;

The 26 tables has the collation utf8_general_ci which sounded quite normal. I guessed mysqldump exported in a different charset (latin1?) So I tried:

mysqldump -u user -p**** --all-databases --default-character-set=utf8 | gzip -9 > dump.sql.gz

Result = I still had the same strange characters.

Note) later I learned that the default charset for mysqldump is utf8, regardless of server default charset. So --default-character-set=utf8 was useless.

#2 Then I thought I could solve the problem by updating the mysql conf. The original conf was:

mysql> show variables like "%character%";show variables like "%collation%";

+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     |
| character_set_connection | latin1                     |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | latin1                     |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
+----------------------+-------------------+
| Variable_name        | Value             |
+----------------------+-------------------+
| collation_connection | latin1_swedish_ci |
| collation_database   | latin1_swedish_ci |
| collation_server     | latin1_swedish_ci |
+----------------------+-------------------+

So I updated /var/lib/mysql/my.cnf and I added:

[mysqld]
init_connect='SET collation_connection = utf8_general_ci'
init_connect='SET NAMES utf8'
character-set-server=utf8
collation-server=utf8_general_ci
default-character-set=utf8
default-collation=utf8_general_ci

[mysqldump]
default-character-set=utf8

Then

/etc/init.d/mysql restart
mysql> show variables like "%character%";show variables like "%collation%";

+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | utf8                       |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
+----------------------+-------------------+
| Variable_name        | Value             |
+----------------------+-------------------+
| collation_connection | utf8_general_ci   |
| collation_database   | latin1_swedish_ci |
| collation_server     | utf8_general_ci   |
+----------------------+-------------------+

Result = same strange characters.

#3 I changed character_set_database and collation_database:

mysql> ALTER DATABASE piwik default character SET utf8 collate utf8_general_ci;

mysql> show variables like "%character%";show variables like "%collation%";

+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | utf8                       |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | utf8                       |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
+----------------------+-----------------+
| Variable_name        | Value           |
+----------------------+-----------------+
| collation_connection | utf8_general_ci |
| collation_database   | utf8_general_ci |
| collation_server     | utf8_general_ci |
+----------------------+-----------------+

Result = same strange characters.

#4 I understand now I should have changed the default character set in MySQL (latin1) to utf8 before creating the database.

The collation utf8_general_ci (#1) means data is stored in utf8. However, would it be possible that mysqldump thinks data is stored in latin1 and encodes data in utf8? It would mean at the end data is double utf8 encoded (sigh). In this case how could I fix the problem?

Thanks for your help.

ps) I wonder why Piwik doesn't require to change the database default charset to utf8.

Answer

Eric Citaire picture Eric Citaire · Mar 22, 2014

According to the table name "piwik_archive_blob_2013_01", I guess the column containing the strange characters is of type BLOB.

BLOB columns contains binary data. That's why it contains these strange characters. This is expected.

Don't worry about it, I'm pretty sure MysqlDump knows how to dump this data.

Cheers, Eric.