Problem Summary:
While trying to convert a site with mysql database from latin1 to utf8, some special characters are not displaying correctly despite ensuring charsets are all utf8 system wide.
Problem Details:
This is a common problem. But I seem to have an added complexity.
Years ago, a oblivious developer (me), put a site together with MySQL. Some tables were setup with latin1_swedish_ci and utf8_general_ci. All input/display was done via pages with iso-8859-1 charset.
Now, I have the task of turning all this data into utf-8 and thus finally uniforming the encoding. However, i'm having issues with a number of special characters in both instances (ie: ü). The characters don't seem to display correctly on a UTF-8 page. They display as �.Instead When viewing the data in a utf8 table in mysql query browser, a correctly entered utf8'd 'u' displays as some special characters, while an incorrectly latin1 'u' displays as it should appear on page. But it doesn't.
I've tried a number of things:
Nothing seems to cure the data.
Dumping the entire database and important isn't really a viable option as it's a huge database now and downtime is restricted.
UPDATE (22-Oct-2013)
I've taken @deceze suggestions and reviewed all my content encoding areas as per http://kunststube.net/frontback/. I did find a few places in which I was still passing/encoding data in latin1. So, i've now changed it all over to UTF-8. However, the data is still displaying incorrectly in a particular field. In a table which is in utf8 (no columns have implicit encoding), field1 is in latin1. I can confirm this by running the following which displays the text correctly:
select convert(cast(convert(field1 using latin1) as binary) using utf8) from my table WHERE id = 1
This will convert Hahnemühle to Hahnemühle.
In field2, it appears the data is in a different (unknown) encoding. The query above, when used on field2 converts Hahnem�hle to Hahnem�hle. I've gone through all the charsets on http://dev.mysql.com/doc/refman/5.5/en/charset-charsets.html replacing latin1 but none seem to spit out the data correctly.
Setting a column to latin1
and others to utf8
is perfectly fine in MySQL. There's no problem to be solved here as such. This charset parameter just influences how the data is stored internally. Which of course also means that you cannot store, for example, "漢字" in a latin1
column. But assuming you're just storing "Latin-1 characters" in there, that's fine.
MySQL has something commonly called the connection encoding. It tells MySQL what encoding text is in that you send to it from PHP (or elsewhere), and what encoding you'd like back when retrieving data from MySQL. The column charset, the "input connection encoding" and "output connection encoding" can all be different things, MySQL will convert encodings on the fly accordingly as needed.
So, assuming you've used the correct connection encodings so far and data is stored properly in your database and you've not tried to store non-Latin-1 characters in Latin-1 columns, all you need to do to update your column charsets to UTF-8 is:
ALTER TABLE table MODIFY column TEXT [...] CHARACTER SET utf8;