how do I specify extended ascii (i.e. range(256)) in the python magic encoding specifier line?

Question 1

how do I specify extended ascii (i.e. range(256)) in the python magic encoding specifier line?

python templates encoding wsgi mako

gred · Jul 27, 2011 · Viewed 8.5k times · Source

Answer

Answer

Short answer

Use cp437 as the encoding for some retro DOS fun. All byte values greater than or equal to 32 decimal, except 127, are mapped to displayable characters in this encoding. Then use cp037 as the encoding for a truly trippy time. And then ask yourself how do you really know which of these, if either of them, is "correct".

Long answer

There is something you must unlearn: the absolute equivalence of byte values and characters.

Many basic text editors and debugging tools today, and also the Python language specification, imply an absolute equivalence between bytes and characters when in reality none exists. It is not true that 74 6f 6b 65 6e is "token". Only for ASCII-compatible character encodings is this correspondence valid. In EBCDIC, which is still quite common today, "token" corresponds to byte values a3 96 92 85 95.

So while the Python 2.6 interpreter happily evaluates 'text' == u'text' as True, it shouldn't, because they are only equivalent under the assumption of ASCII or a compatible encoding, and even then they should not be considered equal. (At least '\xfd' == u'\xfd' is False and gets you a warning for trying.) Python 3.1 evaluates 'text' == b'text' as False. But even the acceptance of this expression by the interpreter implies an absolute equivalence of byte values and characters, because the expression b'text' is taken to mean "the byte-string you get when you apply the ASCII encoding to 'text'" by the interpreter.

As far as I know, every programming language in widespread use today carries an implicit use of ASCII or ISO-8859-1 (Latin-1) character encoding somewhere in its design. In C, the char data type is really a byte. I saw one Java 1.4 VM where the constructor java.lang.String(byte[] data) assumed ISO-8859-1 encoding. Most compilers and interpreters assume ASCII or ISO-8859-1 encoding of source code (some let you change it). In Java, string length is really the UTF-16 code unit length, which is arguably wrong for characters U+10000 and above. In Unix, filenames are byte-strings interpreted according to terminal settings, allowing you to open('a\x08b', 'w').write('Say my name!').

So we have all been trained and conditioned by the tools we have learned to trust, to believe that 'A' is 0x41. But it isn't. 'A' is a character and 0x41 is a byte and they are simply not equal.

Once you have become enlightened on this point, you will have no trouble resolving your issue. You have simply to decide what component in the software is assuming the ASCII encoding for these byte values, and how to either change that behavior or ensure that different byte values appear instead.

PS: The phrases "extended ASCII" and "ANSI character set" are misnomers.

Question 2

I'm using mako templates to generate specialized config files. Some of these files contain extended ASCII chars (>127), but mako chokes saying that the chars are out of range when I use:

## -*- coding: ascii -*-

So I'm wondering if perhaps there's something like:

## -*- coding: eascii -*-

That I can use that will be ok with the range(128, 256) chars.

EDIT:

Here's the dump of the offending section of the file:

000001b0  39 c0 c1 c2 c3 c4 c5 c6  c7 c8 c9 ca cb cc cd ce  |9...............|
000001c0  cf d0 d1 d2 d3 d4 d5 d6  d7 d8 d9 da db dc dd de  |................|
000001d0  df e0 e1 e2 e3 e4 e5 e6  e7 e8 e9 ea eb ec ed ee  |................|
000001e0  ef f0 f1 f2 f3 f4 f5 f6  f7 f8 f9 fa fb fc fd fe  |................|
000001f0  ff 5d 2b 28 27 73 29 3f  22 0a 20 20 20 20 20 20  |.]+('s)?".      |
00000200  20 20 74 6f 6b 65 6e 3a  20 57 4f 52 44 20 20 20  |  token: WORD   |
00000210  20 20 22 5b 41 2d 5a 61  2d 7a 30 2d 39 c0 c1 c2  |  "[A-Za-z0-9...|
00000220  c3 c4 c5 c6 c7 c8 c9 ca  cb cc cd ce cf d0 d1 d2  |................|
00000230  d3 d4 d5 d6 d7 d8 d9 da  db dc dd de df e0 e1 e2  |................|
00000240  e3 e4 e5 e6 e7 e8 e9 ea  eb ec ed ee ef f0 f1 f2  |................|
00000250  f3 f4 f5 f6 f7 f8 f9 fa  fb fc fd fe ff 5d 2b 28  |.............]+(|

The first character that mako complains about is 000001b4. If I remove this section, everything works fine. With the section inserted, mako complains:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 19: ordinal not in range(128)

It's the same complaint whether I use 'ascii' or 'latin-1' in the magic comment line.

Thanks!

Greg

how do I specify extended ascii (i.e. range(256)) in the python magic encoding specifier line?

Answer

Related questions