Why does .net use the UTF16 encoding for string, but uses UTF-8 as default for saving files?

Royi Namir picture Royi Namir · Feb 18, 2013 · Viewed 41.1k times · Source

From here

Essentially, string uses the UTF-16 character encoding form

But when saving vs StreamWriter :

This constructor creates a StreamWriter with UTF-8 encoding without a Byte-Order Mark (BOM),

I've seen this sample (broken link removed):

enter image description here

And it looks like utf8 is smaller for some strings while utf-16 is smaller in some other strings.

  • So why does .net use utf16 as default encoding for string and utf8 for saving files?

Thank you.

p.s. Ive already read the famous article

Answer

Jon Skeet picture Jon Skeet · Feb 18, 2013

If you're happy ignoring surrogate pairs (or equivalently, the possibility of your app needing characters outside the Basic Multilingual Plane), UTF-16 has some nice properties, basically due to always requiring two bytes per code unit and representing all BMP characters in a single code unit each.

Consider the primitive type char. If we use UTF-8 as the in-memory representation and want to cope with all Unicode characters, how big should that be? It could be up to 4 bytes... which means we'd always have to allocate 4 bytes. At that point we might as well use UTF-32!

Of course, we could use UTF-32 as the char representation, but UTF-8 in the string representation, converting as we go.

The two disadvantages of UTF-16 are:

  • The number of code units per Unicode character is variable, because not all characters are in the BMP. Until emoji became popular, this didn't affect many apps in day-to-day use. These days, certainly for messaging apps and the like, developers using UTF-16 really need to know about surrogate pairs.
  • For plain ASCII (which a lot of text is, at least in the west) it takes twice the space of the equivalent UTF-8 encoded text.

(As a side note, I believe Windows uses UTF-16 for Unicode data, and it makes sense for .NET to follow suit for interop reasons. That just pushes the question on one step though.)

Given the problems of surrogate pairs, I suspect if a language/platform were being designed from scratch with no interop requirements (but basing its text handling in Unicode), UTF-16 wouldn't be the best choice. Either UTF-8 (if you want memory efficiency and don't mind some processing complexity in terms of getting to the nth character) or UTF-32 (the other way round) would be a better choice. (Even getting to the nth character has "issues" due to things like different normalization forms. Text is hard...)