What's the shortest pair of strings that causes an MD5 collision?

Alf Eaton picture Alf Eaton · Jan 4, 2010 · Viewed 21.6k times · Source

Up to what string length is it possible to use MD5 as a hash without having to worry about the possibility of a collision?

This would presumably be calculated by generating an MD5 hash for every possible string in a particular character set, in increasing length, until a hash appears for a second time (a collision). The maximum possible length of a string without a collision would then be one character less than the longest of the colliding pair.

Has this already been tested for MD5, SHA1, etc?

Answer

intgr picture intgr · Dec 7, 2010

Update

Ironically, a few weeks after I posted the previous answer, two Chinese researchers, Tao Xie and Dengguo Feng, published a new single-block collision for MD5. I was unaware of that paper until now. A single MD5 block means that the input size is 64 bytes or 512 bits. Note that the inputs are mostly the same, differing only in 2 bits.

Their methodology won't be published until January 2013, but their collision can be verified now, using numbers from the paper:

>>> from array import array
>>> from hashlib import md5
>>> input1 = array('I',  [0x6165300e,0x87a79a55,0xf7c60bd0,0x34febd0b,0x6503cf04,
    0x854f709e,0xfb0fc034,0x874c9c65,0x2f94cc40,0x15a12deb,0x5c15f4a3,0x490786bb,
    0x6d658673,0xa4341f7d,0x8fd75920,0xefd18d5a])
>>> input2 = array('I', [x^y for x,y in zip(input1,
    [0, 0, 0, 0, 0, 1<<10, 0, 0, 0, 0, 1<<31, 0, 0, 0, 0, 0])])
>>> input1 == input2
False
>>> md5(input1).hexdigest()
'cee9a457e790cf20d4bdaa6d69f01e41'
>>> md5(input2).hexdigest()
'cee9a457e790cf20d4bdaa6d69f01e41'

Update: The paper has been published in March 2013: Tao Xie and Fanbao Liu and Dengguo Feng - Fast Collision Attack on MD5

However, if you have more room to play with, collisions of a few kilobytes are MUCH faster to calculate -- they can be calculated within hours on ANY regular computer.

Old answer

The previous shortest collision used at least two MD5 blocks worth of input -- that's 128 bytes, 1024 bits. A prefix in the first block can be chosen arbitrarily by the attacker, the rest would be computed and appear as gibberish.

Here's an example of two different colliding inputs, you can try it yourself in Python:

>>> from binascii import unhexlify
>>> from hashlib import md5
>>> input1 = 'Oded Goldreich\nOded Goldreich\nOded Goldreich\nOded Go' + unhexlify(
... 'd8050d0019bb9318924caa96dce35cb835b349e144e98c50c22cf461244a4064bf1afaecc582'
... '0d428ad38d6bec89a5ad51e29063dd79b16cf67c12978647f5af123de3acf844085cd025b956')
>>> len(input1)
128
>>> md5(input1).hexdigest()
'd320b6433d8ebc1ac65711705721c2e1'
>>> input2 = 'Neal Koblitz\nNeal Koblitz\nNeal Koblitz\nNeal Koblitz\n' + unhexlify(
... '75b80e0035f3d2c909af1baddce35cb835b349e144e88c50c22cf461244a40e4bf1afaecc582'
... '0d428ad38d6bec89a5ad51e29063dd79b16cf6fc11978647f5af123de3acf84408dcd025b956')
>>> md5(input2).hexdigest()
'd320b6433d8ebc1ac65711705721c2e1'

Generating these two particular inputs took 2 days on a 215-node Playstation 3 cluster, by Mark Stevens :)