What MySQL collation is best for accepting all unicode characters?

HellaMad picture HellaMad · Jan 15, 2013 · Viewed 21.5k times · Source

Our column is currently collated to latin1_swedish_ci and special unicode characters are, obviously, getting stripped out. We want to be able to accept chars such as U+272A ✪, U+2764 ❤, (see this wikipedia article) etc. I'm leaning towards utf8_unicode_ci, would this collation handle these and other characters? I don't care about speed as this column isn't an index.

MySQL Version: 5.5.28-1

Answer

deceze picture deceze · Jan 17, 2013

The collation is the least of your worries, what you need to think about is the character set for the column/table/database. The collation (rules governing how data is compared and sorted) is just a corollary of that.

MySQL supports several Unicode character sets, utf8 and utf8mb4 being the most interesting. utf8 supports Unicode characters in the BMP, i.e. a subset of all of Unicode. utf8mb4, available since MySQL 5.5.3, supports all of Unicode.

The collation to be used with any of the Unicode encodings is most likely xxx_general_ci or xxx_unicode_ci. The former is a general sorting and comparison algorithm independent of language, the latter is a more complete language independent algorithm supporting more Unicode features (e.g. treating "ß" and "ss" as equivalent), but is therefore also slower.

See https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-sets.html.