How to Decode Scrambled Character Encoding: Special Character Encoding

balleyne picture balleyne · Jan 2, 2012 · Viewed 11.6k times · Source

I have data in CSV format that has been seriously scrambled character encoding wise, likely going back and forth between different software applications (LibreOffice Calc, Microsoft, Excel, Google Refine, custom PHP/MySQL software; on Windows XP, Windows 7 and GNU/Linux machines from various regions of the world...). It seems like somewhere in the process, non-ASCII characters have become seriously scrambled, and I'm not sure how to descramble them or detect a pattern. To do so manually would involve a few thousand records...

Here's an example. For "Trois-Rivières", when I open this portion of the CSV file in Python, it says:

Trois-Rivi\xc3\x83\xc2\x85\xc3\x82\xc2\xa0res

Question: through what process can I reverse

\xc3\x83\xc2\x85\xc3\x82\xc2\xa0

to get back

è

i.e. how can I unscramble this? How might this have become scrambled in the first place? How can I reverse engineer this bug?

Answer

Guy picture Guy · Jan 31, 2012

You can check the solutions that were offered in: Double-decoding unicode in python

Another simpler brute force solution is to create a mapping table between the small set of scrambled characters using regular expression (((\\\x[a-c0-9]{2}){8})) search on your input file. For a file of a single source, you should have less than 32 for French and less than 10 for German. Then you can run "find and replace" using this small mapping table.