Is this an appropriate use of python's built-in hash function?

wim picture wim · Oct 4, 2011 · Viewed 8.5k times · Source

I need to compare large chunks of data for equality, and I need to compare many pairs per second, fast. Each object is guaranteed to be the same length, it is possible and likely that there may only be slight differences located at unknown positions.

Timings below show that using == operator is very fast if there is a difference near the start of the data, and significantly slower if differences are located towards the end.

>>> import os
>>> s = os.urandom(1600*1200 - 1)
>>> Aimg = b"A" + s
>>> Bimg = b"B" + s
>>> img1 = s + b"1"
>>> img2 = s + b"2"
>>> %timeit Aimg == Bimg
61.8 ns ± 0.484 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
>>> %timeit img1 == img2
159 µs ± 2.83 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In my use case, differences may be located towards the middle or the end of the bytes (context: it is uncompressed image data). I looked for a way to speed things up using a hash or checksum. Using md5 was slower but Python's built-in hash did actually speed things up.

>>> %timeit img1 == img2
160 µs ± 5.96 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit hash(img1) == hash(img2) and img1 == img2
236 ns ± 5.91 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

I'm interested in technical detail of this hash, is it sufficiently hash-like that when hash(a) == hash(b) then a == b is very likely? False positives are acceptable if a hash collision is reasonably rare, the intention is a fast-path to speed up comparisons in the average case.

Answer

phihag picture phihag · Oct 4, 2011

Python's hash function is designed for speed, and maps into a 64-bit space. Due to the birthday paradox, this means you'll likely get a collision at about 5 billion entries (probably way earlier, since the hash function is not cryptographical). Also, the precise definition of hash is up to the Python implementation, and may be architecture- or even machine-specific. Don't use it you want the same result on multiple machines.

md5 is designed as a cryptographic hash function; even slight perturbations in the input totally change the output. It also maps into a 128-bit space, which makes it unlikely you'll ever encounter a collision at all unless you're specifically looking for one.

If you can handle collisions (i.e. test for equality between all members in a bucket, possibly by using a cryptographic algorithm like MD5 or SHA2), Python's hash function is perfectly fine.

One more thing: To save space, you should store the data in binary form if you write it to disk. (i.e. struct.pack('!q', hash('abc')) / hashlib.md5('abc').digest()).

As a side note: is is not equivalent to == in Python. You mean ==.