Most questions/answers on SO and the web discuss using Hive to combine a bunch of small ORC files into a larger one, however, my ORC files are log files which are separated by day and I need to keep them separate. I only want to "roll-up" the ORC files per day (which are directories in HDFS).
I need to write the solution in Java most likely and have come across OrcFileMergeOperator which may be what I need to use, but still too early to tell.
What is the best approach to solving this issue?
You do not need to re-invent the wheel.
ALTER TABLE table_name [PARTITION partition_spec] CONCATENATE
can be used to merge small ORC files into a larger file since Hive 0.14.0.
The merge happens at the stripe level, which avoids decompressing and decoding the data. It works fast. I'd suggest to create an external table partitioned by day (partitions are directories), then merge them all specifying PARTITION (day_column)
as a partition spec.
See here: LanguageManual+ORC