I have a sql-server 2010 database shared between two applications. One application we have control over, and the other application is a third-party app that created the database in the first place. Our app is a CRM built on top of the third party webmail app.
The database contains varchar columns and is latin-1 encoded. The third-party app is written in php and doesn't care about correctly encoding the data, so it stuffs utf-8 encoded bytes into the varchar columns, where they are interpreted as latin-1 and look like garbage.
Our CRM app is written in .Net, which automagically detects that the database collation is different the the encoding of the string in memory, so when .Net writes to a database, it converts the bytes to match the database encoding.
So... data written to the db from our app looks correct in the db, but data from the third party app doesn't.
when our app writes FirstName = Céline, it is stored in the db as Céline
when the webmail app writes FirstName = Céline it is stored in the db as Céline
Our CRM app needs to display contacts that were created in either system. So I'm writing an EncodingSniffer class that looks for flagged characters that indicate its a poorly encoded string and converts them.
currently I have:
private static string[] _flaggedChars = new string[] { "é" };
which works great for displaying Céline as Céline, but I need to add to the list.
Does anyone know of a resource to get all the possible ways that utf-8 special chars could be interpreted as iso-8859-1?
Thanks
Clarification: Since i am working in .Net. The string, when loaded into memory from the database, is converted to Unicode UTF-16. So, regardless of if it was encoded correctly in the database. It is now represented as UTF16 bytes. I need to be able to analyze thes UTF-16 bytes, and determine if they are screwed up due to utf-8 bytes being stuffed into an iso-8859-1 database.... clear as mud right?
Here is what i have so far. It has cleaned up the display of most misencoded characters, but I am still having trouble with É for instance: Éric is stored in the db by webmail as Éric, but after detecting bad encoding and changing it back, it displays as �?ric Looking at a user who has 2500 contacts, hundreds of which had encoding issues, the É is the only thing that isn't displaying correctly...
public static Regex CreateRegex()
{
string specials = "ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö";
List<string> flags = new List<string>();
foreach (char c in specials)
{
string interpretedAsLatin1 = Encoding.GetEncoding("iso-8859-1").GetString(Encoding.UTF8.GetBytes(c.ToString())).Trim();//take the specials, treat them as utf-8, interpret them as latin-1
if (interpretedAsLatin1.Length > 0)//utf-8 chars made up of 2 bytes, interpreted as two single byte latin-1 chars.
flags.Add(interpretedAsLatin1);
}
string regex = string.Empty;
foreach (string s in flags)
{
if (regex.Length > 0)
regex += '|';
regex += s;
}
return new Regex("(" + regex + ")");
}
public static string CheckUTF(string data)
{
Match match = CreateRegex().Match(data);
if (match.Success)
return Encoding.UTF8.GetString(Encoding.GetEncoding("iso-8859-1").GetBytes(data));//from iso-8859-1 (latin-1) to utf-8
else
return data;
}
So: É is being converted to 195'Ã',8240'‰'
You should probably just try to decode the byte string as UTF-8, and if you get an error, assume it's ISO-8859-1 instead.
Text that is encoded as ISO-8859-1 rarely "happens" to also be valid UTF-8... unless it's ISO-8859-1 that only actually contains ASCII, but then in that case you don't have a problem at all, of course. So this method is reasonably robust.
Ignoring which characters occur more frequently than others in actual language, here is a naive analysis that assumes that each character occurs with the same frequency. Let's try to find out how frequently valid ISO-8859-1 can be mistaken for UTF-8 resulting in mojibake. I also assume that C1 control characters (U+0080 through U+009F) don't occur.
For any given byte in the byte string. If the byte is close to the end of the string then you are even more likely to detect malformed UTF-8 because some byte sequences will be known to be not long enough to be valid UTF-8. But assuming the byte is not near the end of the string:
At each byte in a long string, the probability of detecting bad UTF-8 is 0.33+0.09+0.07+0.04 = 0.53.
So for a long string, the probability that ISO-8859-1 will pass silently through a UTF-8 decoder is very small: it approximately halves for each additional character!
This analysis of course assumes random ISO-8859-1 characters. In practice the misdetection rate will not be quite as good as that (mostly on account of the fact that most bytes in real-world text are actually ASCII), but it will still be very good.