How to decide when to use a Map-Side Join or Reduce-Side while writing an MR code in java?

hadoop mapreduce hadoop-streaming

jkalyanc · Apr 19, 2015 · Viewed 20.4k times · Source

Answer

Map side join performs join before data reached to Map. Map function expects a strong prerequisites before joining data at map side. Both method have some pros and cons. Map side join is efficient compare to reduce side but it require strict format.

Prerequisites:

Data should be partitioned and sorted in particular way.
Each input data should be divided in same number of partition.
Must be sorted with same key.
All the records for a particular key must reside in the same partition.

Reduce side join also called as Repartitioned join or Repartitioned sort merge join and also it is mostly used join type. It will have to go through sort and shuffle phase which would incur network overhead.Reduce side join uses few terms like data source, tag and group key lets be familiar with it.

Data Source is referring to data source files, probably taken from RDBMS
Tag would be used to tag every record with it’s source name, so that it’s source can be identified at any given point of time be it is in map/reduce phase. why it is required will cover it later.
Group key is referring column to be used as join key between two data sources.

As we know we are going to join this data on reduce side we must prepare in a way that it can be used for joining in reduce phase. let’s have a look what are the steps needs to be perform.

For more information check this link: http://hadoopinterviews.com/map-side-join-reduce-side-join/

How to decide when to use a Map-Side Join or Reduce-Side while writing an MR code in java?

Answer

Related questions