Number of reducers in hadoop

Mohit Jain picture Mohit Jain · Jul 4, 2016 · Viewed 15k times · Source

I was learning hadoop, I found number of reducers very confusing :

1) Number of reducers is same as number of partitions.

2) Number of reducers is 0.95 or 1.75 multiplied by (no. of nodes) * (no. of maximum containers per node).

3) Number of reducers is set by mapred.reduce.tasks.

4) Number of reducers is closest to: A multiple of the block size * A task time between 5 and 15 minutes * Creates the fewest files possible.

I am very confused, Do we explicitly set number of reducers or it is done by mapreduce program itself?

How is number of reducers is calculated? Please tell me how to calculate number of reducers.

Answer

ViKiG picture ViKiG · Jul 4, 2016

1 - The number of reducers is as number of partitions - False. A single reducer might work on one or more partitions. But a chosen partition will be fully done on the reducer it is started.

2 - That is just a theoretical number of maximum reducers you can configure for a Hadoop cluster. Which is very much dependent on the kind of data you are processing too (decides how much heavy lifting the reducers are burdened with).

3 - The mapred-site.xml configuration is just a suggestion to the Yarn. But internally the ResourceManager has its own algorithm running, optimizing things on the go. So that value is not really the number of reducer tasks running every time.

4 - This one seems a bit unrealistic. My block size might 128MB and everytime I can't have 128*5 minimum number of reducers. That's again is false, I believe.

There is no fixed number of reducers task that can be configured or calculated. It depends on the moment how much of the resources are actually available to allocate.