Yesterday for the first time I exported my Mysql database and I found some very strange characters in the dump such as:
INSERT INTO `piwik_archive_blob_2013_01` VALUES (15,'Actions_actions_url_6',1,'2013-01-17','2013-01-17',1,'2013-01-20 07:36:53','xuNM0ý/œ#&ÝÕ³\ZõNYpÊÀì#!üw7Hж}°ÀAáZoN*šgµ\'GWª[Yûðe¯57 ÃÁÆ7|Ÿ\'Ü%µDh©-EÛ^ËL±ÕÞtªk@(,b±ßZ.ÒÃ6b²aiÓÍ)87[ïÎœ,æya¥uÒ<|+íª7MNuïÝ¿8ñ%1Ʊ>ÚX');
The version of my server MySQL is: 5.1.66-0+squeeze1 (Debian). This database was created automatically by the Piwik setup script.
Here is what I tried to fix this problem:
#1 First I checked the database charset.
> show table status;
The 26 tables has the collation utf8_general_ci which sounded quite normal. I guessed mysqldump exported in a different charset (latin1?) So I tried:
mysqldump -u user -p**** --all-databases --default-character-set=utf8 | gzip -9 > dump.sql.gz
Result = I still had the same strange characters.
Note) later I learned that the default charset for mysqldump is utf8, regardless of server default charset. So --default-character-set=utf8
was useless.
#2 Then I thought I could solve the problem by updating the mysql conf. The original conf was:
mysql> show variables like "%character%";show variables like "%collation%";
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
+----------------------+-------------------+
| Variable_name | Value |
+----------------------+-------------------+
| collation_connection | latin1_swedish_ci |
| collation_database | latin1_swedish_ci |
| collation_server | latin1_swedish_ci |
+----------------------+-------------------+
So I updated /var/lib/mysql/my.cnf
and I added:
[mysqld]
init_connect='SET collation_connection = utf8_general_ci'
init_connect='SET NAMES utf8'
character-set-server=utf8
collation-server=utf8_general_ci
default-character-set=utf8
default-collation=utf8_general_ci
[mysqldump]
default-character-set=utf8
Then
/etc/init.d/mysql restart
mysql> show variables like "%character%";show variables like "%collation%";
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
+----------------------+-------------------+
| Variable_name | Value |
+----------------------+-------------------+
| collation_connection | utf8_general_ci |
| collation_database | latin1_swedish_ci |
| collation_server | utf8_general_ci |
+----------------------+-------------------+
Result = same strange characters.
#3
I changed character_set_database
and collation_database
:
mysql> ALTER DATABASE piwik default character SET utf8 collate utf8_general_ci;
mysql> show variables like "%character%";show variables like "%collation%";
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
+----------------------+-----------------+
| Variable_name | Value |
+----------------------+-----------------+
| collation_connection | utf8_general_ci |
| collation_database | utf8_general_ci |
| collation_server | utf8_general_ci |
+----------------------+-----------------+
Result = same strange characters.
#4 I understand now I should have changed the default character set in MySQL (latin1) to utf8 before creating the database.
The collation utf8_general_ci
(#1) means data is stored in utf8. However, would it be possible that mysqldump thinks data is stored in latin1 and encodes data in utf8?
It would mean at the end data is double utf8 encoded (sigh). In this case how could I fix the problem?
Thanks for your help.
ps) I wonder why Piwik doesn't require to change the database default charset to utf8.
According to the table name "piwik_archive_blob_2013_01", I guess the column containing the strange characters is of type BLOB.
BLOB columns contains binary data. That's why it contains these strange characters. This is expected.
Don't worry about it, I'm pretty sure MysqlDump knows how to dump this data.
Cheers, Eric.