I have data saved as parquet files in Azure blob storage. Data is partitioned by year, month, day and hour like:
cont/data/year=2017/month=02/day=01/
I want to create external table in Hive using following create statement, which I wrote using this reference.
CREATE EXTERNAL TABLE table_name (uid string, title string, value string)
PARTITIONED BY (year int, month int, day int) STORED AS PARQUET
LOCATION 'wasb://cont@storage_name.blob.core.windows.net/data';
This creates table but has no rows when querying. I tried same create statement without PARTITIONED BY
clause and that seems to work. So looks like issue is with partitioning.
What am I missing?
After you create the partitioned table, run the following in order to add the directories as partitions
MSCK REPAIR TABLE table_name;
If you have a large number of partitions you might need to set hive.msck.repair.batch.size
When there is a large number of untracked partitions, there is a provision to run MSCK REPAIR TABLE batch wise to avoid OOME (Out of Memory Error). By giving the configured batch size for the property hive.msck.repair.batch.size it can run in the batches internally. The default value of the property is zero, it means it will execute all the partitions at once.
Written by the OP:
This will probably fix your issue, however if data is very large, it won't work. See relevant issue here.
As a workaround, there is another way to add partitions to Hive metastore one by one like:
alter table table_name add partition(year=2016, month=10, day=11, hour=11)
We wrote simple script to automate this alter statement and it seems to work for now.