Converting mysql tables from latin1 to utf8

Brent Baisley picture Brent Baisley · Nov 1, 2010 · Viewed 12.3k times · Source

I'm trying to convert some mysql tables from latin1 to utf8. I'm using the following command, which seems to mostly work.

ALTER TABLE tablename CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;

However, on one table I get an error about a duplicate key entry. This is caused by a unique index on a "name" field. It seems when converting to utf8, any "special" characters are indexed as their straight english equivalent. For example, there is already a record with a name field value of "Dru". When converting to utf8, a record with "Drü" is considered a duplicate. The same with "Patrick" and "Påtrìçk".

Here is how to reproduce the issue:

CREATE TABLE `example` (   `name` char(20) CHARACTER SET latin1 NOT NULL,
  PRIMARY KEY (`name`) ) ENGINE=MyISAM DEFAULT CHARSET=latin1;

INSERT INTO example (name) VALUES ('Drü'),('Dru'),('Patrick'),('Påtrìçk');

ALTER TABLE example convert to character set utf8 collate utf8_general_ci;
ERROR 1062 (23000): Duplicate entry 'Dru' for key 1

Answer

Hammerite picture Hammerite · Nov 1, 2010

The reason why the strings 'Drü' and 'Dru' evaluate as the same is that in the utf8_general_ci collation, they count as "the same". The purpose of a collation for a character set is to provide a set of rules as to when strings are the same, when one sorts before the other, and so on.

If you want a different set of comparison rules, you need to choose a different collation. You can see the available collations for the utf8 character set by issuing SHOW COLLATION LIKE 'utf8%'. There are a bunch of collations intended for text that is mostly in a specific language; there is also the utf8_bin collation which compares all strings as binary strings (i.e. compares them as sequences of 0s and 1s).