I have a few questions on a data-structure called a hash table (also know as associative array) and how it is implemented in C.
How do you make a hash table in C? What is a hashtable and how do you implement it? Why would I want to use a hash table rather than an array?
NOTE: I know this is a really broad question which will require a large answer but, I made this because I had some people asking me what it all was. so I put it on here to fully explain it and help anyone else out.
Prerequisites
For this answer I'm going to assume you know how to use pointers, structs, and have a basic understanding of the C language.
Also if you don't know. When talking about the speed of algorithms and data structures you should know the terms:
O() = (it's pronounced "Big-oh") Big-oh or O() refers to the "worst-case-scenario" runtime. Similarly, in math, it's big O notation and describes the limiting behavior of a function. If somethings O(1) that's constant time "really good". If somethings O(n) that means if the list is a million long. It is at worst going to run a million time. O() is generally the one used to determine how fast something runs because that's how fast it'll run in it's worst case.
Ω = (greek letter Omega) refers to it's best case scenario. It's not used that as much as O() so I won't go into too much detail about it. But just know that if somethings Ω(1), in it's best case scenario it'll take just one time.
Θ = (greek letter theta) is unique in that it is only used when the O() and Ω() runtime are the same. So like in the case of the recursive sorting algorithm merge sort. It's run time is Θ(n(log(n))). Which means that it's O(n(log(n))) and it's Ω(n(log(n))).
What is a Hash table?
A hash table or associative array is a popular data structure used in programming. A hash table is just a linked list (I'll get to what a linked list is later on) with a hash function. A hash function basically just takes things and puts them in different "baskets". Each "basket" is just another linked list or something else depending on how you implement it. I'll explain more details on hash tables when I show you how to implement one.
Why would I want to use a hash table rather than an array?
An array is very easy to use and simple to make, but it also has its downsides. For this example, let's say we have a program and in which we want to keep all its users in an array.
That's pretty simple. Let's just say we plan on this program having no more than 100 users and fill that array with our users
char* users[100];
// iterate over every user and "store" their name
for (int i = 0; i < userCount; i++)
{
users[i] = "New username here";
}
So that works all well and fine and really fast too. That's O(1) right there. We can access any user in constant time.
But let's now assume that our program gets really popular. It now has over 80 users. Uh-Oh! We better increase the size of that array or else we'll get a buffer overflow.
So how do we do that? Well we're gonna have to make a new array that's bigger and copy over the contents of the old array into the new array.
That's very costly and we don't want to do that. We want to think cleverly and not use a something that has a fixed size. Well we already know how to use pointers to our advantage and we can bundle information into a struct
if we wanted to.
So we could create a struct
to store the username and then have it point (via a pointer) to a new struct
. Voila! We now have a data structure that is expandable. It's a list of bundled information that's linked together by pointers. Thus the name linked list.
Linked Lists
So let's create that linked list. First we're gonna need a struct
typedef struct node
{
char* name;
struct node* next;
}
node;
Alright so we have a string name
and a... Wait a sec... I've never heard of a data type called a struct node
. Well for our convenience I typedef
a new "data type" called node
that also happens to be our struct
called node
.
So now that we have our node for our list, what do we need next? Well we need to create a "root" to our list so we can traverse
it (I'll explain what I mean by traverse
later). So let's assign a root. (remember that node
data type I typdef
ed earlier)
node* first = NULL;
So now that we have our root all we need to do is make a function to insert new usernames into our list.
/*
* inserts a name called buffer into
* our linked list
*/
void insert(char* buffer)
{
// try to instantiate node for number
node* newptr = malloc(sizeof(node));
if (newptr == NULL)
{
return;
}
// make a new ponter
newptr->name = buffer;
newptr->next = NULL;
// check for empty list
if (first == NULL)
{
first = newptr;
}
// check for insertion at tail
else
{
// keep track of the previous spot in list
node* predptr = first;
// because we don't know how long this list is
// we must induce a forever loop until we find the end
while (true)
{
// check if it is the end of the list
if (predptr->next == NULL)
{
// add new node to end of list
predptr->next = newptr;
// break out of forever loop
break;
}
// update pointer
predptr = predptr->next;
}
}
}
So there you go. We have a basic linked list and now we can keep adding users all we want and we don't have to worry about running out of room. But this does come with down sides. The big problem with this is that every node or "user" in our list is "anonymous". We don't know were they are at or even how many users we have with this. (of course there are ways of making this much better -- I just want to show a very basic linked list) We have to traverse the entire list to add a user because we cannot access the end directly.
It's like we are in a huge dust storm and you can't see anything and we need to get to our barn. We can't see where our barn is but we have a solution. There are people standing our there (our node
s) and they are all holding two ropes (our pointers). Each person only owns one rope but that rope is being held at the other end by someone else. Just like our struct
, the rope acts as a pointer to where they are. So how do we get to our barn? (for this example the barn is the last "person" in the list). Well we have no idea how big our line of people are or where they go. In fact, all we see is a fence post with a rope tied to it. (Our root!) that fence post will never change so we can grab the post and start moving along until we see our first person. That person is holding two ropes (the post's pointer and their pointer).
So we keep traveling along the rope until we reach a new person and grab onto their rope. Eventually, we get to the end and find our barn!
So that is a linked list in a nutshell. Its benefits are that it can expand as much as you want but its runtime depends on how big the list is, namely O(n). So if there are 1 million users, it would have to run 1 million times to insert a new name! Wow that seems really wasteful just to insert 1 name.
Luckily, we are clever and can create a better solution. Why don't we, instead of having just one linked list, have a few linked lists. An array of linked lists if you will. Why don't we make an array of size 26. So we can have a unique linked list for every letter of the alphabet. Now instead of a run time of n. We can reasonably say that our new run time will be n/26. Now that won't make much of a difference at all if you have a list 1 million big. But we're just going to keep it simple for this example.
So we have an array of linked lists but how are we going to sort our users into the array. Well... why don't we make a function that decides which user should go where. This function will "hash" the users if you will into an array or "table". So let's create this "hashed" linked list. Thus the name hash table
Hash Table
As I just said, our hash table will be an array of linked lists and will be hashed by the first letter of their username. A
will go to position 0, B
to 1, and so on.
The struct
for the this hash table will be the same as the struct for our previous linked list
typedef struct node
{
char* name;
struct node* next;
}
node;
Now just like our linked list, we need a root for our hash table
node* first[26] = {NULL};
The root will be an array the size of the alphabet and all positions in it will be initialized to NULL
. (Remember: the last element in a linked list always has to point to NULL
or else we wouldn't know it was the end)
Let's make a main function. It takes a username we are going to hash then insert.
int main(char* name)
{
// hash the name into a spot
int hashedValue = hash(name);
// insert the name in table with hashed value
insert(hashedValue, name);
}
So here's our hash function. It's pretty simple. All we want to do is look at the first letter in the word and give a value from 0 - 25 based on what letter it is
/*
* takes a string and hashes it into the correct bucket
*/
int hash(const char* buffer)
{
// assign a number to the first char of buffer from 0-25
return tolower(buffer[0]) - 'a';
}
So now all we need is to create our insert function. It's going to look just like our insert function before except every time we reference our root, we're going to reference it as an array.
/*
* takes a string and inserts it into a linked list at a part of the hash table
*/
void insert(int key, const char* buffer)
{
// try to instantiate node to insert word
node* newptr = malloc(sizeof(node));
if (newptr == NULL)
{
return;
}
// make a new pointer
strcpy(newptr->name, buffer);
newptr->next = NULL;
// check for empty list
if (first[key] == NULL)
{
first[key] = newptr;
}
// check for insertion at tail
else
{
node* predptr = first[key];
while (true)
{
// insert at tail
if (predptr->next == NULL)
{
predptr->next = newptr;
break;
}
// update pointer
predptr = predptr->next;
}
}
}
So that's basics of a hash table. It's pretty simple if you know how to use pointers and structs. I know that was a pretty simple example of a hash table with only an insert function, but you can make it a lot better and get more creative with your hashing function. You can also make the array as big as you want or even a use multi-dimensional array.