German letters and encoding in C#

eMizo picture eMizo · Nov 15, 2013 · Viewed 11.8k times · Source

I have an unzipping function, and I am using System.Text.Encoding to make sure that the files that are being extracted keep the same names after extraction because usually the files that I am unzipping contains German letters.
I tried different things like Encoding.Default or Encoding.UTF8 but nothing works äÄéöÖüß.txt gets converted to „Ž‚”™á.txt or in case of default it is black boxes :/

any suggestions?

using (ZipArchive archive = System.IO.Compression.ZipFile.Open(ZipFile, ZipArchiveMode.Read, System.Text.Encoding.Default))
{

    foreach (ZipArchiveEntry entry in archive.Entries)
    {
        string fullPath = Path.Combine(appPath, entry.FullName);
        if (String.IsNullOrEmpty(entry.Name))
        {
            Directory.CreateDirectory(fullPath);
        }
        else
        {
            if (!entry.Name.Equals("Updater.exe"))
            {
                entry.ExtractToFile(fullPath,true);

            }
        }
    }
}

Answer

Adriano Repetti picture Adriano Repetti · Nov 15, 2013

First of all the only official (not existing...) ZIP format does not allow Unicode characters (then you can't use any encoding other than ASCII).

That said many tools and libraries allow you to use different encoding but it may fail (for example if you try to decode forcing UTF8/UTF32 or whatever a file encoded with another encoding).

If file name is encoded in ASCII it'll get the code page of your system:

For entry names that contain only ASCII characters, the language encoding flag is set, and the current system default code page is used to encode the entry names.

You have not such big control with .NET classes about this topic. But if you do not specify an encoding you'll get default behavior (UTF8 for codes outside ASCII and current code page for ASCII). Most of times it works (if both encoding and decoding has been done within same code page).

How to avoid this? It's not easy (because we lack of a standard) but to summarize:

  • Do not force encoding (unless you're consuming zip file you zipped then with a known encoding).
  • Default behavior is pretty good in most of cases.
  • For ASCII encoded ZIPs with extended characters rely on system code page (it must be the same in both systems).
  • Provide a way to the user to change encoding (you can't check what's encoding used by zip utility and there is no standard about this). It means not only to change encoding (UTF8/UTF16 or whatever) but code page too (in case they doesn't match). GetEncoding function will give you right encoder for code page you specify).

Best hint I can give you? Rely on default behavior (it's pretty common) but provide a way for your users to change it if you need to be compatible with most of ZIPs out there (because each one may be implemented in a different way), not only for encoding but for code page too. Especially do not force it from code with German specific code page because it'll break with first Spanish/French/Italian/Dutch file you'll handle (and there is not a common code page for them).

BTW be ready to handle various exceptions if you open a file with wrong encoding (not code page).

Editing for future readers (from comments): CP 850 catches most of common Western Europe characters but it's not The Code Page for Europe. Compare it, for example, with East Europe languages or with Norwegian. It doesn't match them (and in that languages characters outside 33-127 range are pretty pretty common because they're not box drawing). Some characters from CP 850 (Ê Ë ı for example) are not available in (let's say) CP 865 (for Norsk language).

Let me explain with an example. You have a file name (from Trukey) with this name: "Garip Dosya Adı.txt". Last character has code 141 on CP 857 (for Turkey). If you're using CP 850 you'll get ì instead of ı because in the original CP 850 it has code 213. I won't even mention far east languages (because a fixed code page will make a messy even if you're limited to Europe). This is the reason you can't set a fixed code page unless you're writing a small utility for your own use.