How to unzip .gz files in a new directory in hadoop?

Monica picture Monica · Jan 3, 2016 · Viewed 43.7k times · Source

I have a bunch of .gz files in a folder in hdfs. I want to unzip all of these .gz files to a new folder in hdfs. How should i do this?

Answer

Manjunath Ballur picture Manjunath Ballur · Jan 3, 2016

I can think of achieving it through 3 different ways.

  1. Using Linux command line

    Following command worked for me.

    hadoop fs -cat /tmp/Links.txt.gz | gzip -d | hadoop fs -put - /tmp/unzipped/Links.txt
    

    My gzipped file is Links.txt.gz
    The output gets stored in /tmp/unzipped/Links.txt

  2. Using Java program

    In Hadoop The Definitve Guide book, there is a section on Codecs. In that section, there is a program to Decompress the output using CompressionCodecFactory. I am re-producing that code as is:

    package com.myorg.hadooptests;
    
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.FileSystem;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IOUtils;
    import org.apache.hadoop.io.compress.CompressionCodec;
    import org.apache.hadoop.io.compress.CompressionCodecFactory;
    
    import java.io.InputStream;
    import java.io.OutputStream;
    import java.net.URI;
    
    public class FileDecompressor {
        public static void main(String[] args) throws Exception {
            String uri = args[0];
            Configuration conf = new Configuration();
            FileSystem fs = FileSystem.get(URI.create(uri), conf);
            Path inputPath = new Path(uri);
            CompressionCodecFactory factory = new CompressionCodecFactory(conf);
            CompressionCodec codec = factory.getCodec(inputPath);
            if (codec == null) {
                System.err.println("No codec found for " + uri);
                System.exit(1);
            }
            String outputUri =
            CompressionCodecFactory.removeSuffix(uri, codec.getDefaultExtension());
            InputStream in = null;
            OutputStream out = null;
            try {
                in = codec.createInputStream(fs.open(inputPath));
                out = fs.create(new Path(outputUri));
                IOUtils.copyBytes(in, out, conf);
            } finally {
                IOUtils.closeStream(in);
                IOUtils.closeStream(out);
            }
        }
    }
    

    This code takes the gz file path as input.
    You can execute this as:

    FileDecompressor <gzipped file name>
    

    For e.g. when I executed for my gzipped file:

    FileDecompressor /tmp/Links.txt.gz
    

    I got the unzipped file at location: /tmp/Links.txt

    It stores the unzipped file in the same folder. So you need to modify this code to take 2 input parameters: <input file path> and <output folder>.

    Once you get this program working, you can write a Shell/Perl/Python script to call this program for each of the inputs you have.

  3. Using Pig script

    You can write a simple Pig script to achieve this.

    I wrote the following script, which works:

    A = LOAD '/tmp/Links.txt.gz' USING PigStorage();
    Store A into '/tmp/tmp_unzipped/' USING PigStorage();
    mv /tmp/tmp_unzipped/part-m-00000 /tmp/unzipped/Links.txt
    rm /tmp/tmp_unzipped/
    

    When you run this script, the unzipped contents are stored in a temporary folder: /tmp/tmp_unzipped. This folder will contain

    /tmp/tmp_unzipped/_SUCCESS
    /tmp/tmp_unzipped/part-m-00000
    

    The part-m-00000 contains the unzipped file.

    Hence, we need to explicitly rename it using following command and finally delete the /tmp/tmp_unzipped folder:

    mv /tmp/tmp_unzipped/part-m-00000 /tmp/unzipped/Links.txt
    rm /tmp/tmp_unzipped/
    

    So, if you use this Pig script, you just need to take care of parameterizing the file name (Links.txt.gz and Links.txt).

    Again, once you get this script working, you can write a Shell/Perl/Python script to call this Pig script for each of the inputs you have.