"Fastest" hash function implemented in Java, comparing part of file

carloscloud picture carloscloud · Apr 12, 2011 · Viewed 8.8k times · Source

I need to compare two different files of the instance "File" in Java and want to do this with a fast hash function.

Idea: - Hashing the 20 first lines in File 1 - Hashing the 20 first lines in File 2 - Compare the two hashes and return true if those are equal.

I want to use the "fastest" hash function ever been implemented in Java. Which one would you choose?

Answer

Simon G. picture Simon G. · Apr 12, 2011

If you want speed, do not hash! Especially not a cryptographic hash like MD5. These hashes are designed to be impossible to reverse, not fast to calculate. What you should use is a Checksum - see java.util.zip.Checksum and its two concrete implementations. Adler32 is extremely fast to compute.

Any method based on checksums or hashes is vulnerable to collisions, but you can minimise the risk by using two different methods in the way RSYNC does.

The algorithm is basically:

  • Check file sizes are equal
  • Break the files into chunks of size N bytes
  • Compute checksum on each pair of matching blocks and compare. Any differences prove files are not the same.

This allows for early detection of a difference. You can improve it by computing two checksums at once with different algorithms, or different block sizes.

More bits in the result mean less chance of a collision, but as soon as you go over 64 bits you are outside what Java (and the computer's CPU) can handle natively and hence get slow, so FNV-1024 is less likely to give you a false negative but is much slower.

If it is all about speed, just use Adler32 and accept that very rarely a difference will not be detected. It really is rare. Checksums like these are used to ensure the internet can spot transmission errors, and how often do you get the wrong data turning up?

It it is all about accuracy really, you will have to compare every byte. Nothing else will work.

If you can compromise between speed and accuracy, there is a wealth of options out there.