I am looking for some precise math on the likelihood of collisions for MD5, SHA1, and SHA256 based on the birthday paradox.
I am looking for something like a graph that says "If you have 10^8 keys, this is the probability. If you have 10^13 keys, this is the probability and so on"
I have looked at tons of articles but I am having a tough time finding something that gives me this data. (Ideal option for me would be a formula or code that calculates this for any provided hash size)
Let's imagine we have a truly random hash function that hashes from strings to n-bit numbers. This means that there are 2n possible hash codes, and each string's hash code is chosen uniformly at random from all of those possibilities.
The birthday paradox specifically says that once you've seen roughly √(2k) items, there's a 50% chance of a collision, where k is the number of distinct possible outputs. In the case where the hash function hashes to an n-bit output, this means that you'll need roughly 2n/2 hashes before you get a collision. This is why we typically pick hashes that output 256 bits; it means that we'd need a staggering 2128 ≈1038 items hashed before there's a "reasonable" chance of a collision. With a 512-bit hash, you'd need about 2256 to get a 50% chance of a collision, and 2256 is approximately the number of protons in the known universe.
The exact formula for the probability of getting a collision with an n-bit hash function and k strings hashed is
1 - 2n! / (2kn (2n - k)!)
This is a fairly tricky quantity to work with directly, but we can get a decent approximation of this quantity using the expression
1 - e-k2/2n+1
So, to get (roughly) a probability p chance of a collision, we can solve to get
p ≈ 1 - e-k2/2n+1
1 - p ≈ e-k2/2n+1
ln(1 - p) ≈ -k2/2n+1
-ln(1 - p) ≈ k2/2n+1
-2n+1 ln(1 - p) ≈ k2
2(n+1)/2 √(-ln(1 - p)) ≈ k
As one last approximation, assume we're dealing with very small choices of p. Then ln(1 - p) ≈ -p, so we can rewrite this as
k ≈ 2(n+1)/2 √p
Notice that there's still a monster 2(n+1)/2 term here, so for a 256-bit hash that leading term is 2128.5, which is just enormous. For example, how many items must we see to get a 2-50 chance of a collision with a 256-bit hash? That would be approximately
2(256+1)/2 √2-50
= 2257/2 2-50/2
= 2207/2
= 2103.5.
So you'd need a staggeringly huge number of hashes to have a vanishingly small chance of getting a collision. Figure that 2103.5 is about 1031, which at one nanosecond per hash computed would take you longer than the length of the universe to compute. And after all that, you'd get a success probability of 2-50, which is about 10-15.
In fact, this precisely why we pick such large numbers of bits for our hashes! It makes it extremely unlikely for a collision to occur by chance.
(Note that the hash functions we have today aren't actually truly random functions, which is why people advise against using MD5, SHA1, and others that have had security weaknesses exposed.)
Hope this helps!