I have data in CSV format that has been seriously scrambled character encoding wise, likely going back and forth between different software applications (LibreOffice Calc, Microsoft, Excel, Google Refine, custom PHP/MySQL software; on Windows XP, Windows 7 and GNU/Linux machines from various regions of the world...). It seems like somewhere in the process, non-ASCII characters have become seriously scrambled, and I'm not sure how to descramble them or detect a pattern. To do so manually would involve a few thousand records...
Here's an example. For "Trois-Rivières", when I open this portion of the CSV file in Python, it says:
Trois-Rivi\xc3\x83\xc2\x85\xc3\x82\xc2\xa0res
Question: through what process can I reverse
\xc3\x83\xc2\x85\xc3\x82\xc2\xa0
to get back
è
i.e. how can I unscramble this? How might this have become scrambled in the first place? How can I reverse engineer this bug?
You can check the solutions that were offered in: Double-decoding unicode in python
Another simpler brute force solution is to create a mapping table between the small set of scrambled characters using regular expression (((\\\x[a-c0-9]{2}){8}))
search on your input file. For a file of a single source, you should have less than 32 for French and less than 10 for German. Then you can run "find and replace" using this small mapping table.