Hello there,
even if i really tried... im stuck and somewhat desperate when it comes to Python, Windows, Ansi and character encoding. I need help, seriously... searching the web for the last few hours wasn't any help, it just drives me crazy.
I'm new to Python, so i have almost no clue what's going on. I'm about to learn the language, so my first program, which ist almost done, should automatically generate music-playlists from a given folder containing mp3s. That works just fine, besides one single problem...
...i can't write Umlaute (äöü) to the playlist-file.
After i found a solution for "wrong-encoded" Data in the sys.argv
i was able to deal with that. When reading Metadata from the MP3s, i'm using some sort of simple character substitution to get rid of all those international special chars, like french accents or this crazy skandinavian "o" with a slash in it (i don't even know how to type it...). All fine.
But i'd like to write at least the mentioned Umlaute to the playlist-file, those characters are really common here in Germany. And unlike the Metadata, where i don't care about some missing characters or miss-spelled words, this is relevant - because now i'm writing the paths to the files.
I've tried so many various encoding and decoding methods, i can't list them all here.. heck, i'm not even able to tell which settings i tried half an hour ago. I found code online, here, and elsewhere, that seemed to work for some purposes. Not for mine.
I think the tricky part is this: it seems like the Problem is the Ansi called format of the files i need to write. Correct - i actually need this Ansi-stuff. About two hours ago i actually managed to write whatever i'd like to an UFT-8 file. Works like charm... until i realized that my Player (Winamp, old Version) somehow doesn't work with those UTF-8 playlist files. It couldn't resolve the Path, even if it looks right in my editor.
If i change the file format back to Ansi, Paths containing special chars get corrupted. I'm just guessing, but if Winamp reads this UTF-8 files as Ansi, that would cause the Problem i'm experiencing right now.
So...
line.write(str.decode('utf-8'))
break the funktion of the file# -*- coding: iso-8859-1 -*-
does nothing here (though it is helpful when it comes to the mentioned Metadata and allowed characters in it...)Is there ANYONE who could guide me towards a way out of this encoding hell? Any help is welcome. If i need 500 lines of Code for another functions or classes, i'll type them. If there's a module for handling such stuff, let me know! I'd buy it! Anything helpful will be tested.
Thank you for reading, thanks for any comment,
greets!
As mentioned in the comments, your question isn't very specific, so I'll try to give you some hints about character encodings, see if you can apply those to your specific case!
Here's a small primer about encoding. Basically, there are two ways to represent text in Python:
unicode
. You can consider that unicode
is the ultimate encoding, you should strive to use it everywhere. In Python 2.x source files, unicode
strings look like u'some unicode'
.str
. This is encoded text - to be able to read it, you need to know the encoding (or guess it). In Python 2.x, those strings look like 'some str'
.This changed in Python 3 (unicode
is now str
and str
is now bytes
).
Usually, it's pretty straightforward to ensure that you code uses unicode
for its execution, and uses str
for I/O:
input_string.decode('encoding')
to convert it to unicode
.output_string.encode('encoding')
.The most common encodings are cp-1252
on Windows (on US or EU systems), and utf-8
on Linux.
Windows natively uses unicode
for file paths and names, so you should actually always use unicode
for those.
When you write to the file, be sure to always run your output through output.encode('cp1252')
(or whatever encoding ANSI would be on your system).
By now you probably realized that:
str
as indeed an str
instance, Python will try to convert it to unicode
using the utf-8
encoding, but then try to encode it again (likely in ascii
) to write it to the filestr
is actually an unicode
instance, Python will first encode it (likely in ascii
, and that will probably crash) to then be able to decode it.Bottom line is, you need to know if str
is unicode
, you should encode
it. If it's already encoded, don't touch it (or decode
it then encode
it if the encoding is not the one you want!).
Not a surprise, this only tells Python what encoding should be used to read your source file so that non-ascii characters are properly recognized.
Python 3 probably is a big update in terms of unicode and encoding, but that doesn't mean Python 2.x can't make it work!
You can't be sure, it's possible that the problem lies in the player you're using, not in your code.
Once you output it, you should make sure that your script's output is readable using reference tools (such as Windows Explorer). If it is, but the player still can't open it, you should consider updating to a newer version.