Hive - Efficient join of two tables

maia picture maia · Nov 25, 2013 · Viewed 33.3k times · Source

I am joining two large tables in Hive (one is over 1 billion rows, one is about 100 million rows) like so:

create table joinedTable as select t1.id, ... from t1 join t2 ON (t1.id = t2.id);

I have bucketed the two tables in the same way, clustering by id into 100 buckets for each, but the query is still taking a long time.

Any suggestions on how to speed this up?

Answer

dimamah picture dimamah · Nov 26, 2013

As i see it the answer is a bit more complicated than what @Adrian Lange offered.

First you must understand a very important difference between BucketJoin and Sort-Merge Bucket Join (SMBJ):

To perform a bucketjoin "the amount of buckets in one table must be a multiple of the amount of buckets in the other table" as stated before and in addition hive.optimize.bucketmapjoin must be set to true.
Issuing a join, hive will convert it into a bucketjoin if the above condition take place BUT pay attention that hive will not enforce the bucketing! this means that creating the table bucketed isn't enough for the table to actually be bucketed into the specified amount of buckets as hive doesn't enforce this unless hive.enforce.bucketing is set to true (which means that the amount of buckets actually is set by the amount of reducers in the final stage of the query inserting data into the table).
As from the performance side, please note that when using a bucketjoin a single task reads the "smaller" table into the distributed cache before the mappers access it and do the join - This stage would probably be very very long and ineffective when your table has ~100m rows!
After wards the join will be done same as in a regular join done in the reducers.

To perform a SMBJ both tables have to have the exact same amount of buckets , on the same columns and sorted by these columns in addition to setting hive.optimize.bucketmapjoin.sortedmerge to true.
As in the previous optimization, Hive doesn't enforce the bucketing and the sorting but rather assumes you made sure that the tables are actually bucketed and sorted (not only by definition but by setting hive.enforce.sorting or manually sorting the data while inserting it) - This is very important as it may lead to wrong results in both cases.
As from the performace side, this optimization is way more efficient for the following reasons :

  1. Each mapper reads both buckets and there is no single task contention for distributed cache loading
  2. The join being performed is a merge-sort join as the data is already sorted which is highly more efficient.

Please note the following considerations :

  • in both cases set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
    should be executed
  • in both cases a /*+ MAPJOIN(b) */ should be applied in the query (right after the select and where b is the smaller table)
  • How many buckets?
    This should be viewed from this angle : the consideration should be applied strictly to the bigger table as it has more impact from this direction and latter the configuration will be applied to the smaller table as a must. I think as a rule of thumb each bucket should contain between 1 and 3 blocks, probably somewhere near 2 blocks. so if your block size is 256MB it seams reasonable to me to have ~512MB of data in each bucket in the bigger table so this becomes a simple division issue.

Also, don't forget that these optimizations alone won't always guarantee a faster query time.
Lets say you choose to do a SMBJ, this adds the cost of sorting 2 tables prior to running the join - so the more times you will run your query the less you are "paying" for this sorting stage.

Sometimes, a simple join will lead to the best performance and none of the above optimization will help and you will have to optimize the regular join process either in the application/logical level or by tuning MapReduce / Hive settings like memory usage / parallelism etc.