Bloom filter usage

algorithm data-structures bloom-filter

Bober02 · Jan 18, 2013 · Viewed 7.4k times · Source

I am struggling to understand the usefulness of the bloom filter. I get its underlying logic, space compaction, fast lookups, false positives etc. I just cannot put that concept into a real-life situation as being beneficial. One frequent application is use of bloom filters in web caching. We use bloom filter to determine whether a given URL is in the cache or not. Why don't we simply access the cache to determine that? If we get a yes, we still need to go to cache to retrieve the webpage (which might not be there), but in case of a no, we could have got the same answer using the cache (which is probably optimized for fast lookups anyway?).

Answer

Bloom filters are designed for situations where a false negative is a Very Bad Thing and a false positive is acceptable.

For example, suppose that you are making a web browser and have a known blacklist of scam websites. Your blacklist is massive - in the hundreds of gigabytes - so you can't ship it with the browser. However, you can store it on your own servers. In that case, you could ship the browser with a Bloom filter of an appropriate size that holds all the URLs. Before visiting a site, you look it up in the filter. Then, if you get a "no" answer, you're guaranteed that the URL is not blacklisted and can just visit the site. If you get a "yes" answer, the site might be evil, so you can have the browser call up your main server to get the real answer. The fact that you can save a huge number of calls to the server here without ever sacrificing accuracy is important.

The cache idea is similar to this setup. You can query the filter to see if the page is in the cache. If you get a "no" answer, you're guaranteed it's not cached and can do an expensive operation to pull the data from the main source. Otherwise, you can then check the cache to see if it really is there. In rare instances you might need to check the cache, see that it isn't there, then pull from the main source, but you will never accidentally miss something really in cache.

Hope this helps!

Bloom filter usage

Answer

Related questions