I am trying to create a table in Hive
CREATE TABLE BUCKET_TABLE AS
SELECT a.* FROM TABLE1 a LEFT JOIN TABLE2 b ON (a.key=b.key) WHERE b.key IS NUll
CLUSTERED BY (key) INTO 1000 BUCKETS;
This syntax is failing - but I am not sure if it is even possible to do this combined statement. Any ideas?
Came across this question and saw there was no answer provided. I looked further and found the answer in the Hive documentation.
This will never work, because of the following restrictions on CTAS:
Furthermore https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
...
[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
...
[AS select_statement];
Clustering requires the column to be defined and then the cfg goes to the As select_statement Therefore at this time it is not possible.
Optionally, you can ALTER the table and add buckets, but this does not change existing data.
CREATE TABLE BUCKET_TABLE
STORED AS ORC AS
SELECT a.* FROM TABLE1 a LEFT JOIN TABLE2 b ON (a.key=b.key) WHERE b.key IS NUll limit 0;
ALTER TABLE BUCKET_TABLE CLUSTERED BY (key) INTO 1000 BUCKETS;
ALTER TABLE BUCKET_TABLE SET TBLPROPERTIES ('transactional'='true');
INSERT INTO BUCKET_TABLE
SELECT a.* FROM TABLE1 a LEFT JOIN TABLE2 b ON (a.key=b.key) WHERE b.key IS NUll;