I'm trying to write a C program that uses a hash table to store different words and I could use some help.
Firstly, I create a hash table with the size of a prime number which is closest to the number of the words I have to store, and then I use a hash function to find an address for each word. I started with the simplest function, adding the letters together, which ended up with 88% collision. Then I started experimenting with the function and found out that whatever I change it to, the collisions don't get lower than 35%. Right now I'm using
unsigned int stringToHash(char *word, unsigned int hashTableSize){
unsigned int counter, hashAddress =0;
for (counter =0; word[counter]!='\0'; counter++){
hashAddress = hashAddress*word[counter] + word[counter] + counter;
}
return (hashAddress%hashTableSize);
}
which is just a random function that I came up with, but it gives me the best results - around 35% collision.
I've been reading articles on hash functions for the past a few hours and I tried to use a few simple ones, such as djb2, but all of them gave me even worse results.(djb2 resulted in 37% collision, which is't much worse, but I was expecting something better rather than worse) I also don't know how to use some of the other, more complex ones, such as the murmur2, because I don't know what the parameters (key, len, seed) they take in are.
Is it normal to get more than 35% collisions, even with using the djb2, or am I doing something wrong? What are the key, len and seed values?
Try sdbm:
hashAddress = 0;
for (counter = 0; word[counter]!='\0'; counter++){
hashAddress = word[counter] + (hashAddress << 6) + (hashAddress << 16) - hashAddress;
}
Or djb2:
hashAddress = 5381;
for (counter = 0; word[counter]!='\0'; counter++){
hashAddress = ((hashAddress << 5) + hashAddress) + word[counter];
}
Or Adler32:
uint32_t adler32(const void *buf, size_t buflength) {
const uint8_t *buffer = (const uint8_t*)buf;
uint32_t s1 = 1;
uint32_t s2 = 0;
for (size_t n = 0; n < buflength; n++) {
s1 = (s1 + buffer[n]) % 65521;
s2 = (s2 + s1) % 65521;
}
return (s2 << 16) | s1;
}
// ...
hashAddress = adler32(word, strlen(word));
None of these are really great, though. If you really want good hashes, you need something more complex like lookup3 for example.
Note that a hashtable is expected to have plenty of collisions as soon as it is filled by more than 70-80%. This is perfectly normal and will even happen if you use a very good hash algorithm. That's why most hashtable implementations increase the capacity of the hashtable (e.g. capacity * 1.5
or even capacity * 2
) as soon as you are adding something to the hashtable and the ratio size / capacity
is already above 0.7 to 0.8. Increasing the capacity means a new hashtable is created with a higher capacity, all values from the current one are added to the new one (therefor they must all be rehashed, as their new index will be different in most cases), the new hastable array replaces the old one and the old one is released/freed. If you plan on hashing 1000 words, a hashtable capacity of at 1250 least recommended, better 1400 or even 1500.
Hashtables are not supposed to be "filled to brim", at least not if they shall be fast and efficient (thus they always should have spare capacity). That's the downsize of hashtables, they are fast (O(1)
), yet they will usually waste more space than would be necessary for storing the same data in another structure (when you store them as a sorted array, you will only need a capacity of 1000 for 1000 words; the downsize is that the lookup cannot be faster than O(log n)
in that case). A collision free hashtable is not possible in most cases either way. Pretty much all hashtable implementations expect collisions to happen and usually have some kind of way to deal with them (usually collisions make the lookup somewhat slower, but the hashtable will still work and still beat other data structures in many cases).
Also note that if you are using a pretty good hash function, there is no requirement, yet not even an advantage, if the hashtable has a power of 2 capacity if you are cropping hash values using modulo (%
) in the end. The reason why many hashtable implementations always use power of 2 capacities is because they do not use modulo, instead they use AND (&
) for cropping because an AND operation is among the fastest operations you will find on most CPUs (modulo is never faster than AND, in the best case it would be equally fast, in most cases it is a lot slower). If your hashtable uses power of 2 sizes, you can replace any module with an AND operation:
x % 4 == x & 3
x % 8 == x & 7
x % 16 == x & 15
x % 32 == x & 31
...
This only works for power of 2 sizes, though. If you use modulo, power of 2 sizes can only buy something, if the hash is a very bad hash with a very bad "bit distribution". A bad bit distribution is usually caused by hashes that do not use any kind of bit shifting (>>
or <<
) or any other operations that would have a similar effect as bit shifting.
I created a stripped down lookup3 implementation for you:
#include <stdint.h>
#include <stdlib.h>
#define rot(x,k) (((x)<<(k)) | ((x)>>(32-(k))))
#define mix(a,b,c) \
{ \
a -= c; a ^= rot(c, 4); c += b; \
b -= a; b ^= rot(a, 6); a += c; \
c -= b; c ^= rot(b, 8); b += a; \
a -= c; a ^= rot(c,16); c += b; \
b -= a; b ^= rot(a,19); a += c; \
c -= b; c ^= rot(b, 4); b += a; \
}
#define final(a,b,c) \
{ \
c ^= b; c -= rot(b,14); \
a ^= c; a -= rot(c,11); \
b ^= a; b -= rot(a,25); \
c ^= b; c -= rot(b,16); \
a ^= c; a -= rot(c,4); \
b ^= a; b -= rot(a,14); \
c ^= b; c -= rot(b,24); \
}
uint32_t lookup3 (
const void *key,
size_t length,
uint32_t initval
) {
uint32_t a,b,c;
const uint8_t *k;
const uint32_t *data32Bit;
data32Bit = key;
a = b = c = 0xdeadbeef + (((uint32_t)length)<<2) + initval;
while (length > 12) {
a += *(data32Bit++);
b += *(data32Bit++);
c += *(data32Bit++);
mix(a,b,c);
length -= 12;
}
k = (const uint8_t *)data32Bit;
switch (length) {
case 12: c += ((uint32_t)k[11])<<24;
case 11: c += ((uint32_t)k[10])<<16;
case 10: c += ((uint32_t)k[9])<<8;
case 9 : c += k[8];
case 8 : b += ((uint32_t)k[7])<<24;
case 7 : b += ((uint32_t)k[6])<<16;
case 6 : b += ((uint32_t)k[5])<<8;
case 5 : b += k[4];
case 4 : a += ((uint32_t)k[3])<<24;
case 3 : a += ((uint32_t)k[2])<<16;
case 2 : a += ((uint32_t)k[1])<<8;
case 1 : a += k[0];
break;
case 0 : return c;
}
final(a,b,c);
return c;
}
This code is not as highly optimized for performance as the original code, therefor it is a lot simpler. It is also not as portable as the original code, but it is portable to all major consumer platforms in use today. It is also completely ignoring the CPU endian, yet that is not really an issue, it will work on big and little endian CPUs. Just keep in mind that it will not calculate the same hash for the same data on big and little endian CPUs, but that is no requirement; it will calculate a good hash on both kind of CPUs and its only important that it always calculates the same hash for the same input data on a single machine.
You would use this function as follows:
unsigned int stringToHash(char *word, unsigned int hashTableSize){
unsigned int initval;
unsigned int hashAddress;
initval = 12345;
hashAddress = lookup3(word, strlen(word), initval);
return (hashAddress%hashTableSize);
// If hashtable is guaranteed to always have a size that is a power of 2,
// replace the line above with the following more effective line:
// return (hashAddress & (hashTableSize - 1));
}
You way wonder what initval
is. Well, it is whatever you want it to be. You could call it a salt. It will influence the hash values, yet the hash values will not get better or worse in quality because of this (at least not in the average case, it may lead to more or less collisions for very specific data, though). E.g. you can use different initval
values if you want to hash the same data twice, yet each time should produce a different hash value (there is no guarantee it will, but it is rather likely if initval
is different; if it creates the same value, this would be a very unlucky coincident that you must treat that as a kind of collision). It is not advisable to use different initval
values when hashing data for the same hashtable (this will rather cause more collisions on average). Another use for initval is if you want to combine a hash with some other data, in which case the already existing hash becomes initval
when hashing the other data (so both, the other data as well as the previous hash influence the outcome of the hash function). You may even set initval
to 0
if you like or pick a random value when the hashtable is created (and always use this random value for this instance of hashtable, yet each hashtable has its own random value).
A note on collisions:
Collisions are usually not such a huge problem in practice, it usually does not pay off to waste tons of memory just to avoid them. The question is rather how you are going to deal with them in an efficient way.
You said you are currently dealing with 9000 words. If you were using an unsorted array, finding a word in the array will need 4500 comparisons on average. On my system, 4500 string comparisons (assuming that words are between 3 and 20 characters long) need 38 microseconds (0.000038 seconds). So even such a simple, ineffective algorithm is fast enough for most purposes. Assuming that you are sorting the word list and use a binary search, finding a word in the array will need only 13 comparisons on average. 13 comparisons are close to nothing in terms of time, it's too little to even benchmark reliably. So if finding a word in a hashtable needs 2 to 4 comparisons, I wouldn't even waste a single second on the question whether that may be a huge performance problem.
In your case, a sorted list with binary search may even beat a hashtable by far. Sure, 13 comparisons need more time than 2-4 comparisons, however, in case of a hashtable you must first hash the input data to perform a lookup. Hashing alone may already take longer than 13 comparisons! The better the hash, the longer it will take for the same amount of data to be hashed. So a hashtable only pays off performance-wise if you have a really huge amount of data or if you must update the data frequently (e.g. constantly adding/removing words to/from the table, since these operations are less costly for a hashtable than they are for a sorted list). The fact that a hashatble is O(1)
only means that regardless how big it is, a lookup will approx. always need the same amount of time. O(log n)
only means that the lookup grows logarithmically with the number of words, that means more words, slower lookup. Yet the Big-O notation says nothing about absolute speed! This is a big misunderstanding. It is not said that a O(1)
algorithm always performs faster than a O(log n)
one. The Big-O notation only tells you that if the O(log n)
algorithm is faster for a certain number of values and you keep increasing the number of values, the O(1)
algorithm will certainly overtake the O(log n)
algorithm at some point of time, but your current word count may be far below that point. Without benchmarking both approaches, you cannot say which one is faster by just looking at the Big-O notation.
Back to collisions. What should you do if you run into a collision? If the number of collisions is small, and here I don't mean the overall number of collisions (the number of words that are colliding in the hashtable) but the per index one (the number of words stored at the same hashtable index, so in your case maybe 2-4), the simplest approach is to store them as a linked list. If there was no collision so far for this table index, there is just a single key/value pair. If there was a collision, there is a linked list of key/value pairs. In that case your code must iterate over the linked list and verify each of the keys and return the value if it matches. Going by your numbers, this linked list won't have more than 4 entries and doing 4 comparisons is insignificant in terms of performance. So finding the index is O(1)
, finding the value (or detecting that this key is not in the table) is O(n)
, but here n
is only the number of linked list entries (so it is 4 at most).
If the number of collisions raises, a linked list can become to slow and you may also store a dynamically sized, sorted array of key/value pairs, which allows lookups of O(log n)
and again, n
is only the number of keys in that array, not of all keys in the hastable. Even if there were 100 collisions at one index, finding the right key/value pair takes at most 7 comparisons. That's still close to nothing. Despite the fact that if you really have 100 collisions at one index, either your hash algorithm is unsuited for your key data or the hashtable is far too small in capacity. The disadvantage of a dynamically sized, sorted array is that adding/removing keys is somewhat more work than in case of a linked list (code-wise, not necessarily performance-wise). So using a linked list is usually sufficient if you keep the number of collisions low enough and it is almost trivial to implement such a linked list yourself in C and add it to an existing hashtable implementation.
Most hashtable implementations I have seem use such a "fallback to an alternate data structure" to deal with collisions. The disadvantage is that these require a little bit extra memory to store the alternative data structure and a bit more code to also search for keys in that structure. There are also solutions that store collisions inside the hashtable itself and that don't require any additional memory. However, these solutions have a couple of drawbacks. The first drawback is that every collision increases the chances for even more collisions as more data is added. The second drawback is that while lookup times for keys decrease linearly with the number of collisions so far (and as I said before, every collision leads to even more collisions as data is added), lookup times for keys not in the hashtable decrease even worse and in the end, if you perform a lookup for a key that is not in the hashtable (yet you cannot know without performing the lookup), the lookup may take as long as a linear search over the whole hashtable (YUCK!!!). So if you can spare the extra memory, go for an alternate structure to handle collisions.