I am running a few tests on the storage formats available with Hive and using Parquet and ORC as major options. I included ORC once with default compression and once with Snappy.
I have read many a documents that state Parquet to be better in time/space complexity as compared to ORC but my tests are opposite to the documents I went through.
Follows some details of my data.
Table A- Text File Format- 2.5GB
Table B - ORC - 652MB
Table C - ORC with Snappy - 802MB
Table D - Parquet - 1.9 GB
Parquet was worst as far as compression for my table is concerned.
My tests with the above tables yielded following results.
Row count operation
Text Format Cumulative CPU - 123.33 sec
Parquet Format Cumulative CPU - 204.92 sec
ORC Format Cumulative CPU - 119.99 sec
ORC with SNAPPY Cumulative CPU - 107.05 sec
Sum of a column operation
Text Format Cumulative CPU - 127.85 sec
Parquet Format Cumulative CPU - 255.2 sec
ORC Format Cumulative CPU - 120.48 sec
ORC with SNAPPY Cumulative CPU - 98.27 sec
Average of a column operation
Text Format Cumulative CPU - 128.79 sec
Parquet Format Cumulative CPU - 211.73 sec
ORC Format Cumulative CPU - 165.5 sec
ORC with SNAPPY Cumulative CPU - 135.45 sec
Selecting 4 columns from a given range using where clause
Text Format Cumulative CPU - 72.48 sec
Parquet Format Cumulative CPU - 136.4 sec
ORC Format Cumulative CPU - 96.63 sec
ORC with SNAPPY Cumulative CPU - 82.05 sec
Does that mean ORC is faster then Parquet? Or there is something that I can do to make it work better with query response time and compression ratio?
Thanks!
I would say, that both of these formats have their own advantages.
Parquet might be better if you have highly nested data, because it stores its elements as a tree like Google Dremel does (See here).
Apache ORC might be better if your file-structure is flattened.
And as far as I know parquet does not support Indexes yet. ORC comes with a light weight Index and since Hive 0.14 an additional Bloom Filter which might be helpful the better query response time especially when it comes to sum operations.
The Parquet default compression is SNAPPY. Are Table A - B - C and D holding the same Dataset? If yes it looks like there is something shady about it, when it only compresses to 1.9 GB