Fast disk-based hashtables?

taw picture taw · Jan 30, 2009 · Viewed 15.1k times · Source

I have sets of hashes (first 64 bits of MD5, so they're distributed very randomly) and I want to be able to see if a new hash is in a set, and to add it to a set.

Sets aren't too big, the largest will be millions of elements, but there are hundreds of sets, so I cannot hold them all in memory.

Some ideas I had so far:

  • I tried just keeping it all in sqlite table, but it becomes really really slow once it cannot fit everything in memory.
  • Bloom filters sound like they would have very high error rate. I don't mind tiny error rate (64 bit hash gives 1 collision on 4G element set already), but error rates like 1% are a lot too high.
  • Keep sorted list of hashes with gaps in a file, and resize when I don't have enough gaps. Hashes are uniformly distributed, so even very simple scheme like that should work.

Am I missing something really obvious? Any hints how to implement good disk-based hashtable?

Answer

taw picture taw · Feb 3, 2009

Here's the solution I eventually used:

  • One file per set
  • File contains 2^k buckets, each 256 bytes or 32 entries of 8 bytes
  • Empty entries are just zeroed out (000... is a valid hash, but I don't care about 2^-64 chance of collision, if everything can collide with everything else already, by the nature of hashing).
  • Every hash resides in bucket guessed via its first k bits
  • If any bucket overflows, double file size and split every bucket
  • Everything is accessed via mmap(), not read()/write()

It's just unbelievably faster than sqlite, even though it's low-level Perl code, and Perl really isn't meant for high performance databases. It will not work with anything that's less uniformly distributed than MD5, its assuming everything will be extremely uniform to keep the implementation simple.

I tried it with seek()/sysread()/syswrite() at first, and it was very slow, mmap() version is really a lot faster.