Can I cluster by/bucket a table created via "CREATE TABLE AS SELECT....." in Hive?

Andrew picture Andrew · Jul 22, 2014 · Viewed 8.3k times · Source

I am trying to create a table in Hive

CREATE TABLE BUCKET_TABLE AS 
SELECT a.* FROM TABLE1 a LEFT JOIN TABLE2 b ON (a.key=b.key) WHERE b.key IS NUll
CLUSTERED BY (key) INTO 1000 BUCKETS;

This syntax is failing - but I am not sure if it is even possible to do this combined statement. Any ideas?

Answer

Nebulastic picture Nebulastic · Aug 1, 2017

Came across this question and saw there was no answer provided. I looked further and found the answer in the Hive documentation.

This will never work, because of the following restrictions on CTAS:

  1. The target table cannot be a partitioned table.
  2. The target table cannot be an external table.
  3. The target table cannot be a list bucketing table.

Source: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableAsSelect%28CTAS

Furthermore https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
...
[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
...
[AS select_statement];

Clustering requires the column to be defined and then the cfg goes to the As select_statement Therefore at this time it is not possible.

Optionally, you can ALTER the table and add buckets, but this does not change existing data.

CREATE TABLE BUCKET_TABLE 
STORED AS ORC AS 
SELECT a.* FROM TABLE1 a LEFT JOIN TABLE2 b ON (a.key=b.key) WHERE b.key IS NUll limit 0;
ALTER TABLE BUCKET_TABLE CLUSTERED BY (key) INTO 1000 BUCKETS;
ALTER TABLE BUCKET_TABLE SET TBLPROPERTIES ('transactional'='true');
INSERT INTO BUCKET_TABLE 
SELECT a.* FROM TABLE1 a LEFT JOIN TABLE2 b ON (a.key=b.key) WHERE b.key IS NUll;