Sorting large data using MapReduce/Hadoop

Chander Shivdasani picture Chander Shivdasani · Sep 2, 2010 · Viewed 25.9k times · Source

I am reading about MapReduce and the following thing is confusing me.

Suppose we have a file with 1 million entries(integers) and we want to sort them using MapReduce. The way i understood to go about it is as follows:

Write a mapper function that sorts integers. So the framework will divide the input file into multiple chunks and would give them to different mappers. Each mapper will sort their chunk of data independent of each other. Once all the mappers are done, we will pass each of their results to Reducer and it will combine the result and give me the final output.

My doubt is, if we have one reducer, then how does it leverage the distributed framework, if, eventually, we have to combine the result at one place?. The problem drills down to merging 1 million entries at one place. Is that so or am i missing something?

Thanks, Chander

Answer

Peter Tillemans picture Peter Tillemans · Sep 2, 2010

Check out merge-sort.

It turns out that sorting partially sorted lists is much more efficient in terms of operations and memory consumption than sorting the complete list.

If the reducer gets 4 sorted lists it only needs to look for the smallest element of the 4 lists and pick that one. If the number of lists is constant this reducing is an O(N) operation.

Also typically the reducers are also "distributed" in something like a tree, so the work can be parrallelized too.