After viewing my prod logs, I have some error mentionning :
[2012-08-31 15:56:43] request.CRITICAL: Doctrine\DBAL\DBALException:
An exception occurred while executing 'SELECT t0.username ....... FROM fos_user t0 WHERE t0.username = ?'
with params {"1":"Nrv\u29e7Kasi"}:
SQLSTATE[HY000]: General error: 1267 Illegal mix of collations (latin1_swedish_ci,IMPLICIT)
and (utf8_general_ci,COERCIBLE) for operation '='
Alghout i have UTF-8 default under the doctrine cfg :
doctrine:
dbal:
charset: UTF8
It seems that all my MySQL Tables are in latin1_swedish_ci
, so my question is :
Can I manually change the collation to utf8_general_ci
for all my tables without any complications/precautions ?
It is helpful to understand the following definitions:
A character encoding details how each symbol is represented in binary (and therefore stored in the computer). For example, the symbol é
(U+00E9, latin small letter E with acute) is encoded as 0xc3a9
in UTF-8 (which MySQL calls utf8
) and 0xe9
in Windows-1252 (which MySQL calls latin1
).
A character set is the alphabet of symbols that can be represented using a given character encoding. Confusingly, the term is also used to mean the same as character encoding.
A collation is an ordering on a character set, so that strings can be compared. For example: MySQL's latin1_swedish_ci
collation treats most accented variations of a character as equivalent to the base character, whereas its latin1_general_ci
collation will order them before the next base character but not equivalent (there are other, more significant, differences too: such as the order of characters like å
, ä
, ö
and ß
).
MySQL will decide which collation should be applied to a given expression as documented under Collation of Expressions: in particular, the collation of a column takes precedence over that of a string literal.
The WHERE
clause of your query compares the following strings:
a value in fos_user.username
, encoded in the column's character set (Windows-1252) and expressing a preference for its collation latin1_swedish_ci
(with a coercibility value of 2); with
the string literal 'Nrv⧧Kasi'
, encoded in the connection's character set (UTF-8, as configured by Doctrine) and expressing a preference for the connection's collation utf8_general_ci
(with a coercibility value of 4).
Since the first of these strings has a lower coercibility value than the second, MySQL attempts to perform the comparison using that string's collation: latin1_swedish_ci
. To do so, MySQL attempts to convert the second string to latin1
—but since the ⧧
character does not exist in that character set, the comparison fails.
One should pause for a moment to consider how the column is currently encoded: you are attempting to filter for records where fos_user.username
is equal to a string that contains a character which cannot exist in that column!
If you believe that the column does contain such characters, then you probably wrote to the column whilst the connection character encoding was set to something (e.g. latin1
) that caused MySQL to interpret the received byte sequence as characters which are all in the Windows-1252 character set.
If this is the case, before continuing any further you should fix your data!
convert such columns to the character encoding that was used on data insertion, if different to the incumbent encoding:
ALTER TABLE fos_users MODIFY username VARCHAR(123) CHARACTER SET foo;
drop the encoding information associated with such columns by converting them to the binary
character set:
ALTER TABLE fos_users MODIFY username VARCHAR(123) CHARACTER SET binary;
associate with such columns the encoding in which data was actually transmitted by converting them to the relevant character set.
ALTER TABLE fos_users MODIFY username VARCHAR(123) CHARACTER SET bar;
Note that, if converting from a multi-byte encoding, you may need to increase the size of the column (or even change its type) in order to accomodate the maximum possible length of the converted string.
Once one is certain that the columns are correctly encoded, one could force the comparison to be conducted using a Unicode collation by either—
explicitly converting the value fos_user.username
to a Unicode character set:
WHERE CONVERT(fos_user.username USING utf8) = ?
forcing the string literal to have a lower coercibility value than the column (will cause an implicit conversion of the column's value to UTF-8):
WHERE fos_user.username = ? COLLATE utf8_general_ci
Or one could, as you say, permanently convert the column(s) to a Unicode encoding and set its collation appropriately.
Can I manually change the collation to
utf8_general_ci
for all my tables without any complications/precautions ?
The principle consideration is that Unicode encodings take up more space than single-byte character sets, so:
more storage may be required;
comparisons may be slower; and
index prefix lengths may need to be adjusted (note that the maximum is in bytes, so may represent fewer characters than previously).
Also, be aware that, as documented under ALTER TABLE
Syntax:
To change the table default character set and all character columns (
CHAR
,VARCHAR
,TEXT
) to a new character set, use a statement like this:ALTER TABLE tbl_name CONVERT TO CHARACTER SET charset_name;For a column that has a data type of
VARCHAR
or one of theTEXT
types,CONVERT TO CHARACTER SET
will change the data type as necessary to ensure that the new column is long enough to store as many characters as the original column. For example, aTEXT
column has two length bytes, which store the byte-length of values in the column, up to a maximum of 65,535. For alatin1
TEXT
column, each character requires a single byte, so the column can store up to 65,535 characters. If the column is converted toutf8
, each character might require up to three bytes, for a maximum possible length of 3 × 65,535 = 196,605 bytes. That length will not fit in aTEXT
column's length bytes, so MySQL will convert the data type toMEDIUMTEXT
, which is the smallest string type for which the length bytes can record a value of 196,605. Similarly, aVARCHAR
column might be converted toMEDIUMTEXT
.To avoid data type changes of the type just described, do not use
CONVERT TO CHARACTER SET
. Instead, useMODIFY
to change individual columns.