Can I avoid using CP1252 on Windows?

peter.murray.rust picture peter.murray.rust · Dec 21, 2009 · Viewed 8.7k times · Source

I would like all my toolkit to use UTF-8 but find that some tools on Windows seem to use CP1252 (which appears to be Windows-specific). Does this create output which is incompatible and if so at which codepoints? If so, can I do anything about it?

(I don't completely understand the issues so I'd be grateful for basic education on these encodings).

Answer

Tim Pietzcker picture Tim Pietzcker · Dec 21, 2009

Six years old and still relevant: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Now, about your question: Yes, there are still tools out there that choke on UTF-8 files. But more and more tools are "getting it". If you're developing your own stuff, you might want to look into Python 3 where all strings are Unicode. The philosophy is to convert all your inputs into Unicode (if necessary) as early as possible, and reconvert them to a target encoding as late as possible. There are toolkits out there that will do a good job of guessing the encoding of a particular file (for example, Mark Pilgrim's chardet, a port of Mozilla's encoding detector). This is nice if you're working with files that don't specify an encoding.