C# UNICODE to ANSI conversion

alex picture alex · Jun 10, 2013 · Viewed 20.7k times · Source

I need your help concerning something which disturbs me when working with UNICODE encoding in .NET Framework ...

I have to interface with some customer data systems with are non-UNICODE applications, and those customers have worldwide companies (Chinese, Korean, Russian, ...). So they have to provide me an ASCII 8 bits file, wich will be encoded with their Windows code page.

So, if a Greek customer sends me a text file containing 'Σ' (sigma letter '\u03A3') in a product name, I will get an equivalent letter corresponding to the 211 ANSI code point, represented in my own code page. My computer is a French Windows, which means the code page is Windows-1252, so I will have in place 'Ó' in this text file... Ok.

I know this customer is a Greek one, so I can read his file by forcing the windows-1253 code page in my import parameters.

/// <summary>
/// Convert a string ASCII value using code page encoding to Unicode encoding
/// </summary>
/// <param name="value"></param>
/// <returns></returns>
public static string ToUnicode(string value, int codePage)
{
    Encoding windows = Encoding.Default;
    Encoding unicode = Encoding.Unicode;
    Encoding sp = Encoding.GetEncoding(codePage);
    if (sp != null && !String.IsNullOrEmpty(value))
    {
        // First get bytes in windows encoding
        byte[] wbytes = windows.GetBytes(value);

        // Check if CodePage to use is different from current Windows one
        if (windows.CodePage != sp.CodePage)
        {
            // Convert to Unicode using SP code page
            byte[] ubytes = Encoding.Convert(sp, unicode, wbytes);
            return unicode.GetString(ubytes);
        }
        else
        {
            // Directly convert to Unicode using windows code page
            byte[] ubytes = Encoding.Convert(windows, unicode, wbytes);
            return unicode.GetString(ubytes);
        }
    }
    else
    {
        return value;
    }
}

Well in the end I got 'Σ' in my application and I am able to save this into my SQL Server database. Now my application has to perform some complex computations, and then I have to give back this file to the customer with an automatic export...

So my problem is that I have to perform a UNICODE => ANSI conversion?! But this is not as simple as I thought at the beginning...

I don't want to save the code page used during import, so my first idea was to convert UNICODE to windows-1252, and then automatically send the file to the customers. They will read the exported text file with their own code page so this idea was interesting for me.

But the problem is that the conversion in this way has a strange behaviour... Here are two different examples:

1st example (я)

char ya = '\u042F';
string strYa = Char.ConvertFromUtf32(ya);
System.Text.Encoding unicode = System.Text.Encoding.Unicode;
System.Text.Encoding ansi1252 = System.Text.Encoding.GetEncoding(1252);
System.Text.Encoding ansi1251 = System.Text.Encoding.GetEncoding(1251);

string strYa1252 = ansi1252.GetString(System.Text.Encoding.Convert(unicode, ansi1252, unicode.GetBytes(strYa)));
string strYa1251 = ansi1251.GetString(System.Text.Encoding.Convert(unicode, ansi1251, unicode.GetBytes(strYa)));

So strYa1252 contains '?', whereas strYa1251 contains valid char 'я'. So it seems it is impossible te convert to ANSI if valid code page is not indicated to Convert() function ... So nothing in Unicode Encoding class helps user to get equivalences between ANSI and UNICODE code points ? :\

2nd example (Σ)

char sigma = '\u3A3';
string strSigma = Char.ConvertFromUtf32(sigma);
System.Text.Encoding unicode = System.Text.Encoding.Unicode;
System.Text.Encoding ansi1252 = System.Text.Encoding.GetEncoding(1252);
System.Text.Encoding ansi1253 = System.Text.Encoding.GetEncoding(1253);

string strSigma1252 = ansi1252.GetString(System.Text.Encoding.Convert(unicode, ansi1252, unicode.GetBytes(strSigma)));
string strSigma1253 = ansi1253.GetString(System.Text.Encoding.Convert(unicode, ansi1253, unicode.GetBytes(strSigma)));

At this time, I have the correct 'Σ' in the strSigma1253 string, but I also have 'S' for strSigma1252. As indicated at the beginning, I should have 'Ó' if ANSI code has been found, or '?' if the character has not been found, but not 'S'. Why? Yes of course, a linguist could say that 'S' is equivalent to the greek Sigma character because they sound the same in both alphabets, but they don't have the same ANSI code!

So how can the Convert() function in the .NET framework manage this kind of equivalence?

And does someone have an idea to write back ANSI characters from UNICODE in text files I have to send to customers?

Answer

bobince picture bobince · Jun 11, 2013

I should have ...'?' if the character has not been found, but not 'S'. Why?

This is known as 'best-fit' encoding, and in most cases it's a bad thing. When Windows can't encode a character to the target code page (because Σ does not exist in code page 1252), it makes best efforts to map the character to something a bit like it. This can mean losing the diacritical marks (ëe), or mapping to a cognate (ΣS), a character that's related (=), a character that's unrelated but looks a bit similar (8), or whatever other madcap replacement seemed like a good idea at the time but turns out to be culturally or mathematically offensive in practice.

You can see the tables for cp1252, including that Sigma mapping, here.

Apart from being a silent mangling of dubious usefulness, it also has some quite bad security implications. You should be able to stop it happening by setting EncoderFallback to ReplacementFallback or ExceptionFallback.

does someone have an idea to write back ANSI characters from UNICODE in text files I have to send to customers?

You'll have to keep a table of encodings for each customer. Read their input files using that encoding to decode; write their output files using the same encoding.

(For sanity, set new customers to UTF-8 and document that this is the preferred encoding.)