Concat Avro files using avro-tools

54l3d picture 54l3d · Jan 18, 2016 · Viewed 8.5k times · Source

Im trying to merge avro files into one big file, the problem is concat command does not accept the wildcard

hadoop jar avro-tools.jar concat /input/part* /output/bigfile.avro

I get:

Exception in thread "main" java.io.FileNotFoundException: File does not exist: /input/part*

I tried to use "" and '' but no chance.

Answer

Clément MATHIEU picture Clément MATHIEU · Jan 20, 2016

I quickly checked Avro's source code (1.7.7) and it seems that concat does not support glob patterns (basically, they call FileSystem.open() on each argument except the last one).

It means that you have to explicitly provide all the filenames as argument. It is cumbersome, but following command should do what you want:

IN=$(hadoop fs -ls /input/part* | awk '{printf "%s ", $NF}')
hadoop jar avro-tools.jar concat ${IN} /output/bigfile.avro

It would be a nice addition to add support of glob pattern to this command.