How to copy and extract .gz files using python

bflance picture bflance · Oct 26, 2014 · Viewed 18.8k times · Source

I am just starting to learn python and have a question.

How to create a script to do following: ( will write how i do it in bash)

  1. Copy <file>.gz from remote server1 to local storage.

    cp /dumps/server1/file1.gz /local/

  2. Then extract that file locally.

    gunzip /local/file1.gz

  3. Then copy the extract file to remote server2 (for archiving and deduplication purposes)

    cp /local/file1.dump /dedupmount

  4. delete local copy of .gz file to free space on "temporary" storage

    rm -rf /local/file1.gz

I need to run all that in loop for all files. All files and directories are NFS mounted on same server.

A for loop goes through /dump/ folder and looks for .gz files. Each .gz file will be first copied to /local directory, and then extracted there. Once extracted, the unzipped .dmp file will be copied to /dedupmount folder for archiving.

Just banging my head on wall how to write this.

Answer

John1024 picture John1024 · Oct 26, 2014

Python Solution

While the shell code might be shorter, the whole process can be done natively in python. The key points in the python solution are:

  • With the gzip module, gzipped files are as easy to read as normal files.

  • To obtain the list of source files, the glob module is used. It is modeled after the shell glob feature.

  • To manipulate paths, use the python os.path module. It provides a OS-independent interface to the file system.

Here is sample code:

import gzip
import glob
import os.path
source_dir = "/dumps/server1"
dest_dir = "/dedupmount"

for src_name in glob.glob(os.path.join(source_dir, '*.gz')):
    base = os.path.basename(src_name)
    dest_name = os.path.join(dest_dir, base[:-3])
    with gzip.open(src_name, 'rb') as infile:
        with open(dest_name, 'wb') as outfile:
            for line in infile:
                outfile.write(line)

This code reads from the remote1 server and writes to the remote2 server. This is no need for a local copy unless you want one.

In this code, all decompression is done by the CPU on the local machine.

Shell code

For comparison, here is the equivalent shell code:

for src in /dumps/server1/*.gz
do
    base=${src##*/}
    dest="/dedupmount/${base%.gz}"
    zcat "$src" >"$dest"
done

Three-Step Python Code

This slightly more complex approach implements the OP's three-step algorithm which uses a temporary file on the local machine:

import gzip
import glob
import os.path
import shutil

source_dir = "./dumps/server1"
dest_dir = "./dedupmount"
tmpfile = "/tmp/delete.me"

for src_name in glob.glob(os.path.join(source_dir, '*.gz')):
    base = os.path.basename(src_name)
    dest_name = os.path.join(dest_dir, base[:-3])
    shutil.copyfile(src_name, tmpfile)
    with gzip.open(tmpfile, 'rb') as infile:
        with open(dest_name, 'wb') as outfile:
            for line in infile:
                outfile.write(line)

This copies the source file to a temporary file on the local machine, tmpfile, and then gunzips it from there to the destination file. tmpfile will be overwritten with every invocation of this script.

Temporary files can be a security issue. To avoid this, place the temporary file in a directory that is write-able only by the user who runs this script.