Wikipedia says:
An empty Bloom filter is a bit array of m bits, all set to 0. There must also be k different hash functions defined, each of which maps or hashes some set element to one of the m array positions with a uniform random distribution.
I read the article, but what I don't understand is how k is determined. Is it a function of the table size?
Also, in hash tables I've written I used a simple but effective algorithm for automatically growing the hash's size. Basically, if ever more than 50% of the buckets in the table were filled, I would double the size of the table. I suspect you might still want to do this with a bloom filter to reduce false positives. Correct ?
Given:
n
: how many items you expect to have in your filter (e.g. 216,553)p
: your acceptable false positive rate {0..1} (e.g. 0.01
→ 1%)we want to calculate:
m
: the number of bits needed in the bloom filterk
: the number of hash functions we should applyThe formulas:
m = -n*ln(p) / (ln(2)^2)
the number of bits
k = m/n * ln(2)
the number of hash functions
In our case:
m
= -216553*ln(0.01) / (ln(2)^2)
= 997263 / 0.48045
= 2,075,686
bits (253 kB)k
= m/n * ln(2)
= 2075686/216553 * 0.693147
= 6.46
hash functions (7 hash functions)Note: Any code released into public domain. No attribution required.