How do I Combine or Merge Small ORC files into Larger ORC file?

Chris C picture Chris C · Apr 26, 2018 · Viewed 7.6k times · Source

Most questions/answers on SO and the web discuss using Hive to combine a bunch of small ORC files into a larger one, however, my ORC files are log files which are separated by day and I need to keep them separate. I only want to "roll-up" the ORC files per day (which are directories in HDFS).

I need to write the solution in Java most likely and have come across OrcFileMergeOperator which may be what I need to use, but still too early to tell.

What is the best approach to solving this issue?

Answer

leftjoin picture leftjoin · Apr 26, 2018

You do not need to re-invent the wheel.

ALTER TABLE table_name [PARTITION partition_spec] CONCATENATE can be used to merge small ORC files into a larger file since Hive 0.14.0. The merge happens at the stripe level, which avoids decompressing and decoding the data. It works fast. I'd suggest to create an external table partitioned by day (partitions are directories), then merge them all specifying PARTITION (day_column) as a partition spec.

See here: LanguageManual+ORC