When should I choose bucket sort over other sorting algorithms?

Rony picture Rony · Jul 26, 2015 · Viewed 15.5k times · Source

When is bucket sort algorithm the best method to use for sorting? Is there a recommended guide in using them based on the size, type of data structure?

Answer

templatetypedef picture templatetypedef · Aug 25, 2015

Bucket sort is a non-comparison based sorting algorithm that assumes it's possible to create an array of buckets and distribute the items to be sorted into those buckets by index. Therefore, as a prerequisite for even using bucket sort in the first place, you need to have some way of obtaining an index for each item. Those indices can't just be from a hash function; they need to satisfy the property that if any object x comes before any object y, then x's bucket index must be no greater than y's bucket index. Many objects have this property - you can sort integers this way by looking at some of the bits of the number, and you can sort strings this way by looking at the first few characters - but many do not.

The advantage of bucket sort is that once the elements are distributed into buckets, each bucket can be processed independently of the others. This means that you often need to sort much smaller arrays as a follow-up step than the original array. It also means that you can sort all of the buckets in parallel with one another. The disadvantage is that if you get a bad distribution into the buckets, you may end up doing a huge amount of extra work for no benefit or a minimal benefit. As a result, bucket sort works best when the data are more or less uniformly distributed or where there is an intelligent way to choose the buckets given a quick set of heuristics based on the input array. Bucket sort also works well if you have a large degree of parallelism available.

Another advantage of bucket sort is that you can use it as an external sorting algorithm. If you need to sort a list that is so huge you can't fit it into memory, you can stream the list through RAM, distribute the items into buckets stored in external files, then sort each file in RAM independently.

Here are a few disadvantages of bucket sort:

  • As mentioned above, you can't apply it to all data types because you need a good bucketing scheme.
  • Bucket sort's efficiency is sensitive to the distribution of the input values, so if you have tightly-clustered values, it's not worth it.
  • In many cases where you could use bucket sort, you could also use another specialized sorting algorithm like radix sort, counting sort, or burstsort instead and get better performance.
  • The performance of bucket sort depends on the number of buckets chosen, which might require some extra performance tuning compared to other algorithms.

I hope this helps give you a sense of the relative advantages and disadvantages of bucket sort. Ultimately, the best way to figure out whether it's a good fit is to compare it against other algorithms and see how it actually does, though the above criteria might help you avoid spending your time comparing it in cases where it's unlikely to work well.