Hadoop combiner sort phase

Michael Mior picture Michael Mior · Oct 19, 2011 · Viewed 7.5k times · Source

When running a MapReduce job with a specified combiner, is the combiner run during the sort phase? I understand that the combiner is run on mapper output for each spill, but it seems like it would also be beneficial to run during intermediate steps when merge sorting. I'm assuming here that in some stages of the sort, mapper output for some equivalent keys is held in memory at some point.

If this doesn't currently happen, is there a particular reason, or just something which hasn't been implemented?

Thanks in advance!

Answer

Thomas Jungblut picture Thomas Jungblut · Oct 19, 2011

Combiners are there to save network bandwidth.

The mapoutput directly gets sorted:

sorter.sort(MapOutputBuffer.this, kvstart, endPosition, reporter);

This happens right after the real mapping is done. During iteration through the buffer it checks if there has a combiner been set and if yes it combines the records. If not, it directly spills onto disk.

The important parts are in the MapTask, if you'd like to see it for yourself.

    sorter.sort(MapOutputBuffer.this, kvstart, endPosition, reporter);
    // some fields
    for (int i = 0; i < partitions; ++i) {
        // check if configured
        if (combinerRunner == null) {
          // spill directly
        } else {
            combinerRunner.combine(kvIter, combineCollector);
        }
    }

This is the right stage to save the disk space and the network bandwidth, because it is very likely that the output has to be transfered. During the merge/shuffle/sort phase it is not beneficial because then you have to crunch more amounts of data in comparision with the combiner run at map finish time.

Note the sort-phase which is shown in the web interface is misleading. It is just pure merging.