I have sets of hashes (first 64 bits of MD5, so they're distributed very randomly) and I want to be able to see if a new hash is in a set, and to add it to a set.
Sets aren't too big, the largest will be millions of elements, but there are hundreds of sets, so I cannot hold them all in memory.
Some ideas I had so far:
Am I missing something really obvious? Any hints how to implement good disk-based hashtable?
Here's the solution I eventually used:
It's just unbelievably faster than sqlite, even though it's low-level Perl code, and Perl really isn't meant for high performance databases. It will not work with anything that's less uniformly distributed than MD5, its assuming everything will be extremely uniform to keep the implementation simple.
I tried it with seek()/sysread()/syswrite() at first, and it was very slow, mmap() version is really a lot faster.