How are hash functions like MD5 unique?

Aly picture Aly · Mar 15, 2010 · Viewed 50.3k times · Source

I'm aware that MD5 has had some collisions but this is more of a high-level question about hashing functions.

If MD5 hashes any arbitrary string into a 32-digit hex value, then according to the Pigeonhole Principle surely this can not be unique, as there are more unique arbitrary strings than there are unique 32-digit hex values.

Answer

Mike Cargal picture Mike Cargal · Mar 15, 2010

You're correct that it cannot guarantee uniqueness, however there are approximately 3.402823669209387e+38 different values in a 32 digit hex value (16^32). That means that, assuming the math behind the algorithm gives a good distribution, your odds are phenomenally small that there will be a duplicate. You do have to keep in mind that it IS possible to duplicate when you're thinking about how it will be used. MD5 is generally used to determine if something has been changed (I.e. it's a checksum). It would be ridiculously unlikely that something could be modified and result in the same MD5 checksum.

Edit: (given recent news re: SHA1 hashes) The answer above, still holds, but you shouldn't expect an MD5 hash to serve as any kind of security check against manipulation. SHA-1 Hashes as 2^32 (over 4 billion) times less likely to collide, and it has been demonstrated that it is possible to contrive an input to produce the same value. (This was demonstrated against MD5 quite some time ago). If you're looking to ensure nobody has maliciously modified something to produce the same hash value, these days, you need at SHA-2 to have a solid guarantee.

On the other hand, if it's not in a security check context, MD5 still has it's usefulness.

The argument could be made that an SHA-2 hash is cheap enough to compute, that you should just use it anyway.