I am joining two large tables in Hive (one is over 1 billion rows, one is about 100 million rows) like so:
create table joinedTable as select t1.id, ... from t1 join t2 ON (t1.id = t2.id);
I have bucketed the two tables in the same way, clustering by id into 100 buckets for each, but the query is still taking a long time.
Any suggestions on how to speed this up?
As i see it the answer is a bit more complicated than what @Adrian Lange offered.
First you must understand a very important difference between BucketJoin and Sort-Merge Bucket Join (SMBJ):
To perform a bucketjoin "the amount of buckets in one table must be a multiple of the amount of buckets in the other table" as stated before and in addition hive.optimize.bucketmapjoin
must be set to true.
Issuing a join, hive will convert it into a bucketjoin if the above condition take place BUT pay attention that hive will not enforce the bucketing! this means that creating the table bucketed isn't enough for the table to actually be bucketed into the specified amount of buckets as hive doesn't enforce this unless hive.enforce.bucketing
is set to true (which means that the amount of buckets actually is set by the amount of reducers in the final stage of the query inserting data into the table).
As from the performance side, please note that when using a bucketjoin a single task reads the "smaller" table into the distributed cache before the mappers access it and do the join - This stage would probably be very very long and ineffective when your table has ~100m rows!
After wards the join will be done same as in a regular join done in the reducers.
To perform a SMBJ both tables have to have the exact same amount of buckets , on the same columns and sorted by these columns in addition to setting hive.optimize.bucketmapjoin.sortedmerge
to true.
As in the previous optimization, Hive doesn't enforce the bucketing and the sorting but rather assumes you made sure that the tables are actually bucketed and sorted (not only by definition but by setting hive.enforce.sorting
or manually sorting the data while inserting it) - This is very important as it may lead to wrong results in both cases.
As from the performace side, this optimization is way more efficient for the following reasons :
Please note the following considerations :
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
/*+ MAPJOIN(b) */
should be applied in the query (right after the select
and where b
is the smaller table)Also, don't forget that these optimizations alone won't always guarantee a faster query time.
Lets say you choose to do a SMBJ, this adds the cost of sorting 2 tables prior to running the join - so the more times you will run your query the less you are "paying" for this sorting stage.
Sometimes, a simple join will lead to the best performance and none of the above optimization will help and you will have to optimize the regular join process either in the application/logical level or by tuning MapReduce / Hive settings like memory usage / parallelism etc.