Choosing a Data structure for very large data

Carlos picture Carlos · Nov 24, 2010 · Viewed 18k times · Source

I have x (millions) positive integers, where their values can be as big as allowed (+2,147,483,647). Assuming they are unique, what is the best way to store them for a lookup intensive program.

So far i thought of using a binary AVL tree or a hash table, where the integer is the key to the mapped data (a name). However am not to sure whether i can implement such large keys and in such large quantity with a hash table (wouldn't that create a >0.8 load factor in addition to be prone for collisions?)

Could i get some advise on which data structure might be suitable for my situation

Answer

Jeffrey Hantin picture Jeffrey Hantin · Nov 24, 2010

The choice of structure depends heavily on how much memory you have available. I'm assuming based on the description that you need lookup but not to loop over them, find nearest, or other similar operations.

Best is probably a bucketed hash table. By placing hash collisions into buckets and keeping separate arrays in the bucket for keys and values, you can both reduce the size of the table proper and take advantage of CPU cache speedup when searching a bucket. Linear search within a bucket may even end up faster than binary search!

AVL trees are nice for data sets that are read-intensive but not read-only AND require ordered enumeration, find nearest and similar operations, but they're an annoyingly amount of work to implement correctly. You may get better performance with a B-tree because of CPU cache behavior, though, especially a cache-oblivious B-tree algorithm.