I have a bunch of .gz files in a folder in hdfs. I want to unzip all of these .gz files to a new folder in hdfs. How should i do this?
I can think of achieving it through 3 different ways.
Using Linux command line
Following command worked for me.
hadoop fs -cat /tmp/Links.txt.gz | gzip -d | hadoop fs -put - /tmp/unzipped/Links.txt
My gzipped file is Links.txt.gz
The output gets stored in /tmp/unzipped/Links.txt
Using Java program
In Hadoop The Definitve Guide
book, there is a section on Codecs
. In that section, there is a program to Decompress the output using CompressionCodecFactory
. I am re-producing that code as is:
package com.myorg.hadooptests;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.URI;
public class FileDecompressor {
public static void main(String[] args) throws Exception {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
Path inputPath = new Path(uri);
CompressionCodecFactory factory = new CompressionCodecFactory(conf);
CompressionCodec codec = factory.getCodec(inputPath);
if (codec == null) {
System.err.println("No codec found for " + uri);
System.exit(1);
}
String outputUri =
CompressionCodecFactory.removeSuffix(uri, codec.getDefaultExtension());
InputStream in = null;
OutputStream out = null;
try {
in = codec.createInputStream(fs.open(inputPath));
out = fs.create(new Path(outputUri));
IOUtils.copyBytes(in, out, conf);
} finally {
IOUtils.closeStream(in);
IOUtils.closeStream(out);
}
}
}
This code takes the gz file path as input.
You can execute this as:
FileDecompressor <gzipped file name>
For e.g. when I executed for my gzipped file:
FileDecompressor /tmp/Links.txt.gz
I got the unzipped file at location: /tmp/Links.txt
It stores the unzipped file in the same folder. So you need to modify this code to take 2 input parameters: <input file path> and <output folder>
.
Once you get this program working, you can write a Shell/Perl/Python script to call this program for each of the inputs you have.
Using Pig script
You can write a simple Pig script to achieve this.
I wrote the following script, which works:
A = LOAD '/tmp/Links.txt.gz' USING PigStorage();
Store A into '/tmp/tmp_unzipped/' USING PigStorage();
mv /tmp/tmp_unzipped/part-m-00000 /tmp/unzipped/Links.txt
rm /tmp/tmp_unzipped/
When you run this script, the unzipped contents are stored in a temporary folder: /tmp/tmp_unzipped
. This folder will contain
/tmp/tmp_unzipped/_SUCCESS
/tmp/tmp_unzipped/part-m-00000
The part-m-00000
contains the unzipped file.
Hence, we need to explicitly rename it using following command and finally delete the /tmp/tmp_unzipped
folder:
mv /tmp/tmp_unzipped/part-m-00000 /tmp/unzipped/Links.txt
rm /tmp/tmp_unzipped/
So, if you use this Pig script, you just need to take care of parameterizing the file name (Links.txt.gz and Links.txt).
Again, once you get this script working, you can write a Shell/Perl/Python script to call this Pig script for each of the inputs you have.