Set parquet snappy output file size is hive?

Question 1

Set parquet snappy output file size is hive?

hive impala parquet snappy

Hatim Diab · Jun 15, 2015 · Viewed 9.2k times · Source

Answer

Answer

There is not a way to close files once they grow to the size of a single HDFS block and start a new file. That would go against how HDFS typically works: having files that span many blocks.

The right solution is for Impala to schedule its tasks where the blocks are local instead of complaining that the file spans more than one block. This was completed recently as IMPALA-1881 and will be released in Impala 2.3.

Question 2

I'm trying to split parquet/snappy files created by hive INSERT OVERWRITE TABLE... on dfs.block.size boundary as impala issues a warning when a file in a partition is larger then block size.

impala logs the following WARNINGS:

Parquet files should not be split into multiple hdfs-blocks. file=hdfs://<SERVER>/<PATH>/<PARTITION>/000000_0 (1 of 7 similar)

Code:

CREATE TABLE <TABLE_NAME>(<FILEDS>)
PARTITIONED BY (
    year SMALLINT,
    month TINYINT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\037'
STORED AS PARQUET TBLPROPERTIES ("parquet.compression"="SNAPPY");

As for the INSERT hql script:

SET dfs.block.size=134217728;
SET hive.exec.reducers.bytes.per.reducer=134217728;
SET hive.merge.mapfiles=true;
SET hive.merge.size.per.task=134217728;
SET hive.merge.smallfiles.avgsize=67108864;
SET hive.exec.compress.output=true;
SET mapred.max.split.size=134217728;
SET mapred.output.compression.type=BLOCK;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
INSERT OVERWRITE TABLE <TABLE_NAME>
PARTITION (year=<YEAR>, month=<MONTH>)
SELECT <FIELDS>
from <ANOTHER_TABLE> where year=<YEAR> and month=<MONTH>;

The issue is file seizes are all over the place:

partition 1: 1 file: size = 163.9 M 
partition 2: 2 file: size = 207.4 M, 128.0 M
partition 3: 3 file: size = 166.3 M, 153.5 M, 162.6 M
partition 4: 3 file: size = 151.4 M, 150.7 M, 45.2 M

The issue is the same no matter the dfs.block.size setting (and other settings above) increased to 256M, 512M or 1G (for different data sets).

Is there a way/settings to make sure that the splitting of the output parquet/snappy files are just below hdfs block size?

Set parquet snappy output file size is hive?

Answer

Related questions