Multiple table join in hive

chhaya vishwakarma picture chhaya vishwakarma · Mar 13, 2015 · Viewed 46.2k times · Source

I have migrated Teradata tables' data into hive .

Now I have to build summary tables on top of imported data. Summary table needs to be built from five source tables

If I go with joins I'll need to join five tables is it possible in hive ? or should I break the query in five parts? what should be advisable approach for this problem?

Please suggest

Answer

StephenBoesch picture StephenBoesch · Mar 13, 2015

Five way joins in hive are of course possible and also (naturally) likely slow to very slow.

You should consider co-partitioning the tables on

  • identical partition columns
  • identical number of partitions

Other options include hints. For example consider if one of the tables were large and the others small. You may then be able to use streamtble hint

Assuming a is large:

SELECT /*+ STREAMTABLE(a) */ a.val, b.val, c.val, d.val, e.val 
FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1) join d on (d.key = c.key) join e on (e.key = d.key)

Adapted from : https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins :

All five tables are joined in a single map/reduce job and the values for a particular value of the key for tables b, c,d, and e are buffered in the memory in the reducers. Then for each row retrieved from a, the join is computed with the buffered rows. If the STREAMTABLE hint is omitted, Hive streams the rightmost table in the join.

Another hint is the mapjoin that is useful to cache small tables in memory.

Assuming a is large and b,c,d,e are small enough to fit in memory of each mapper:

 SELECT /*+ MAPJOIN(b,c,d,e) */  a.val, b.val, c.val, d.val, e.val 
 FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1) 
 join d on (d.key = c.key) join e on (e.key = d.key)