Fast hash for strings

Antonio picture Antonio · Feb 24, 2014 · Viewed 22.7k times · Source

I have a set of ASCII strings, let's say they are file paths. They could be both short and quite long.

I'm looking for an algorithm that could calculate hash of such a strings and this hash will be also a string, but will have a fixed length, like youtube video ids:

https://www.youtube.com/watch?v=-F-3E8pyjFo
                                ^^^^^^^^^^^

MD5 seems to be what I need, but it is critical for me to have a short hash strings.

Is there a shell command or python library which can do that?

Answer

Erik Aronesty picture Erik Aronesty · May 29, 2018

Python has a built-in hash() function that's very fast and perfect for most uses:

>>> hash("dfds")
3591916071403198536

You can then make it unsigned:

>>> hashu=lambda word: ctypes.c_uint64(hash(word)).value

You can then turn it into a 16 byte hex string:

>>> hashu("dfds").to_bytes(8,"big").hex()

Or an N*2 byte string, where N is <= 8:

>>> hashn=lambda word, N  : (hashu(word)%(2**(N*8))).to_bytes(N,"big").hex()

..etc. And if you want N to be larger than 8 bytes, you can just hash twice. Python's built-in is so vastly faster, it's never worth using hashlib for anything unless you need security... not just collision resistance.

>>> hashnbig=lambda word, N  : ((hashu(word)+2**64*hashu(word+"2"))%(2**(N*8))).to_bytes(N,"big").hex()

And finally, use the urlsafe base64 encoding to make a much better string than "hex" gives you

>>> hashnbigu=lambda word, N  : urlsafe_b64encode(((hashu(word)+2**64*hash(word+"2"))%(2**(N*8))).to_bytes(N,"big")).decode("utf8").rstrip("=")
>>> hashnbigu("foo",16)
'ZblnvrRqHwAy2lnvrR4HrA'

Caveats:

  • Be warned that in Python 3.3 and up, this function is randomized and won't work for some use cases. You can disable this with PYTHONHASHSEED=0

  • See https://github.com/flier/pyfasthash for fast, stable hashes that won't break your CPU for non-cryptographic applications.

  • Don't use this lambda style in real code... write it out! And stuffing things like 2**32 in your code, instead of making them constants is bad form.

  • In the end 8 bytes of collision resistance is OK for a smaller applications.... with less than a million entries, you've got collision odds of < 0.0000001%. That's a 12 byte b64 encoded string. But it might not be enough for larger apps.

  • 16 bytes is enough for a UUID/OID in a cache, etc.