I've got the following question about choosing hash functions for Bloom filters:
In nearly every document/paper you can read that the hash functions used in a Bloom filter should be independent and uniformly distributed.
I know what is meant by this (independent and uniformly distributed), but I'm having trouble to find a argumentation or a discussion, which hash functions fulfill those requirements and are therefore suitable. In a lot of posts I've read about suggestions for the usage of the FNV or Murmur hash function, but not why (or at least without a proof) they are suitable.
Thanks in advance!
I asked myself the same question when building a Java Bloom filter library. See the Github readme for a detailed treatment of my analysis of hash functions for Bloom filters.
I looked at the problem from two perspectives:
Speed can easily be measured by benchmarks on random input. Uniformity is a bit harder and requires some statistics. Using Chi-Square goodness of fit tests I measured how similar the distribution of hash values is to a uniform distribution.
The result is:
If your implementation is using Java I would recommend using our Bloom filter hash library. It is well documented and thoroughly tested. For the details, including the benchmark results for different hash function and their unformity according to Chi-Square test, see the Github readme of the repo.