I have migrated Teradata tables' data into hive .
Now I have to build summary tables on top of imported data. Summary table needs to be built from five source tables
If I go with joins I'll need to join five tables is it possible in hive ? or should I break the query in five parts? what should be advisable approach for this problem?
Please suggest
Five way joins in hive are of course possible and also (naturally) likely slow to very slow.
You should consider co-partitioning the tables on
Other options include hints. For example consider if one of the tables were large and the others small. You may then be able to use streamtble hint
Assuming a is large:
SELECT /*+ STREAMTABLE(a) */ a.val, b.val, c.val, d.val, e.val
FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1) join d on (d.key = c.key) join e on (e.key = d.key)
Adapted from : https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins :
All five tables are joined in a single map/reduce job and the values for a particular value of the key for tables b, c,d, and e are buffered in the memory in the reducers. Then for each row retrieved from a, the join is computed with the buffered rows. If the STREAMTABLE hint is omitted, Hive streams the rightmost table in the join.
Another hint is the mapjoin that is useful to cache small tables in memory.
Assuming a is large and b,c,d,e are small enough to fit in memory of each mapper:
SELECT /*+ MAPJOIN(b,c,d,e) */ a.val, b.val, c.val, d.val, e.val
FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)
join d on (d.key = c.key) join e on (e.key = d.key)