I have many files in HDFS, all of them a zip file with one CSV file inside it. I'm trying to uncompress the files so I can run a streaming job on them.
I tried:
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-D mapred.reduce.tasks=0 \
-mapper /bin/zcat -reducer /bin/cat \
-input /path/to/files/ \
-output /path/to/output
However I get an error (subprocess failed with code 1
)
I also tried running on a single file, same error.
Any advice?
The root cause of the problem is: you get many (text-)infos from hadoop (before you can receive the data).
e.g. hdfs dfs -cat hdfs://hdm1.gphd.local:8020/hive/gphd/warehouse/my.db/my/part-m-00000.gz | zcat | wc -l will NOT work either - with "gzip: stdin: not in gzip format" error message.
Therefore you should skip this "unneccesary" infos. In my case I have to skip 86 lines
Therefore my one line command will be this (for counting the records): hdfs dfs -cat hdfs://hdm1.gphd.local:8020/hive/gphd/warehouse/my.db/my/part-m-00000.gz |tail -n+86 | zcat | wc -l
Note: this is a workaround (not a real solution) and very ugly - because of "86" - but it works fine :)