What characters do not directly map from Cp1252 to UTF-8?

Christian picture Christian · Oct 12, 2014 · Viewed 59.8k times · Source

I've read in several stackoverflow answers that some characters do not directly map (or are even "unmappable") when converting from Cp1252 (aka Windows-1252; they're the same, aren't they?) to UTF-8, e.g. here: https://stackoverflow.com/a/23399926/2018047

Can someone please shed some more light on this? Does that mean that if I batch/mass convert source code from cp1252 to utf-8 I'll get some characters that will end up as garbage?

Answer

Karol S picture Karol S · Oct 12, 2014

This is how Windows 1252 codepage looks like.

As you can see, bytes 0x81, 0x8D, 0x8F, 0x90, 0x9D do not have anything assigned to them.

If your input file contains those bytes, and you treat it as if it was in Windows 1252 encoding, those bytes will be treated as invalid characters. In normal circumstances, this means that the input file was not in Windows 1252.

All other bytes encode either printable characters or control characters, and all those characters are present in Unicode and therefore can unambiguously be encoded in UTF-8.

I have no idea what the linked answer is trying to claim, its last paragraph sounds like nonsense.

Several more remarks, which may shine some light on what you are trying to get to know:

  • UTF-8 and Windows 1252 are totally incompatible with each other outside ASCII

  • both of those encodings will never encode text to certain byte values, different ones in each case

  • moreover, certain byte sequences are also invalid in UTF-8

  • in general, if you treat a file as if it contained text encoded in UTF-8 or Windows 1252, but it doesn't, you will lose and corrupt data

You can select the encoding of your files in your IDE or editor. It's recommended to go UTF-8 only. You will have to convert existing Windows 1252 files.