Python Comparison of byte literals

Matthew Hemke picture Matthew Hemke · Jul 19, 2014 · Viewed 36.2k times · Source

The following question arose because I was trying to use bytes strings as dictionary keys and bytes values that I understood to be equal weren't being treated as equal.

Why doesn't the following python code compare equal - aren't these two equivalent representations of the same binary data (example knowingly chosen to avoid endianess)?

b'0b11111111' == b'0xff'

I know the following evaluates true, demonstrating the equivalence:

int(b'0b11111111', 2) == int(b'0xff', 16)

But why does python force me to know the representation? Is it related to endian-ness? Is there some easy way to force these to compare equivalent other than converting them all to e.g. hex literals? Can anyone suggest a transparent and clear method to move between all representations in a (somewhat) platform independent way (or am I asking too much)?

Edit:

Given the comments below, say I want to actually index a dictionary using 8 bits in the form b'0b11111111', then why does python expand it to ten bytes and how do I prevent that?

This is a smaller piece of a large tree data structure and expanding my indexing by a factor of 80 seems like a huge waste of memory.

Answer

Martijn Pieters picture Martijn Pieters · Jul 19, 2014

Bytes can represent any number of things. Python cannot and will not guess at what your bytes might encode.

For example, int(b'0b11111111', 34) is also a valid interpretation, but that interpretation is not equal to hex FF.

The number of interpretations, in fact, is endless. The bytes could represent a series of ASCII codepoints, or image colors, or musical notes.

Until you explicitly apply an interpretation, the bytes object consists just of the sequence of values in the range 0-255, and the textual representation of those bytes use ASCII if so representable as printable text:

>>> list(bytes(b'0b11111111'))
[48, 98, 49, 49, 49, 49, 49, 49, 49, 49]
>>> list(bytes(b'0xff'))
[48, 120, 102, 102]

Those byte sequences are not equal.

If you want to interpret these sequences explicitly as integer literals, then use ast.literal_eval() to interpret decoded text values; always normalise first before comparison:

>>> import ast
>>> ast.literal_eval(b'0b11111111'.decode('utf8'))
255
>>> ast.literal_eval(b'0xff'.decode('utf8'))
255