The fundamentals of Hash tables?

kylex picture kylex · Nov 12, 2008 · Viewed 34.9k times · Source

I'm quite confused about the basic concepts of a Hash table. If I were to code a hash how would I even begin? What is the difference between a Hash table and just a normal array?

Basically if someone answered this question I think all my questions would be answered: If I had 100 randomly generated numbers (as keys), how would I implement a hash table and why would that be advantageous over an array?

Psuedo-code or Java would be appreciated as a learning tool...

Answer

Adam Liss picture Adam Liss · Nov 12, 2008

The answers so far have helped to define hash tables and explain some theory, but I think an example may help you get a better feeling for them.

What is the difference between a hash table and just a normal array?

A hash table and an array are both structures that allow you to store and retrieve data. Both allow you to specify an index and retrieve a value associated with it. The difference, as Daniel Spiewak noted, is that the indices of an array are sequential, while those of a hash table are based on the value of the data associated with them.

Why would I use a hash table?

A hash table can provide a very efficient way to search for items in large amounts of data, particularly data that is not otherwise easily searchable. ("Large" here means ginormous, in the sense that it would take a long time to perform a sequential search).

If I were to code a hash how would I even begin?

No problem. The simplest way is to invent an arbitrary mathematical operation that you can perform on the data, that returns a number N (usually an integer). Then use that number as the index into an array of "buckets" and store your data in bucket #N. The trick is in selecting an operation that tends to place values in different buckets in a way that makes it easy for your to find them later.

Example: A large mall keeps a database of its patrons' cars and parking locations, to help shoppers remember where they parked. The database stores make, color, license plate, and parking location. On leaving the store a shopper finds his car by entering the its make and color. The database returns a (relatively short) list of license plates and parking spaces. A quick scan locates the shopper's car.

You could implement this with an SQL query:

SELECT license, location FROM cars WHERE make="$(make)" AND color="$(color)"

If the data were stored in an array, which is essentially just a list, you can imagine implementing the query by scanning an array for all matching entries.

On the other hand, imagine a hash rule:

Add the ASCII character codes of all the letters in the make and color, divide by 100, and use the remainder as the hash value.

This rule will convert each item to a number between 0 and 99, essentially sorting the data into 100 buckets. Each time a customer needs to locate a car, you can hash the make and color to find the one bucket out of 100 that contains the information. You've immediately reduced the search by a factor of 100!

Now scale the example to huge amounts of data, say a database with millions of entries that is searched based on tens of criteria. A "good" hash function will distribute the data into buckets in a way that minimizes any additional searching, saving a significant amount of time.