There are a few libraries used to extract archive files through Python, such as gzip, zipfile library, rarfile, tarfile, patool etc. I found one of the libraries (patool) to be especially useful due to its cross-format feature in the sense that it can extract almost any type of archive including the most popular ones such as ZIP, GZIP, TAR and RAR.
To extract an archive file with patool it is as easy as this:
patoolib.extract_archive( "Archive.zip",outdir="Folder1")
Where the "Archive.zip"
is the path of the archive file and the "Folder1"
is the path of the directory where the extracted file will be stored.
The extracting works fine. The problem is that if I run the same code again for the exact same archive file, an identical extracted file will be stored in the same folder but with a slightly different name (filename at the first run, filename1 at the second, filename11 at the third and so on.
Instead of this, I need the code to overwrite the extracted file if a file under a same name already exists in the directory.
This extract_archive
function looks so minimal - it only have these two parameters, a verbosity
parameter, and a program
parameter which specifies the program you want to extract archives with.
Edits:
Nizam Mohamed's answer documented that extract_archive
function is actually overwriting the output. I found out that was partially true - the function overwrites ZIP files, but not GZ files which is what I am after. For GZ files, the function still generates new files.
Edits Padraic Cunningham's answer suggested using the master source . So, I downloaded that code and replaced my old patool library scripts with the scripts in the link. Here is the result:
os.listdir()
Out[11]: ['a.gz']
patoolib.extract_archive("a.gz",verbosity=1,outdir=".")
patool: Extracting a.gz ...
patool: ... a.gz extracted to `.'.
Out[12]: '.'
patoolib.extract_archive("a.gz",verbosity=1,outdir=".")
patool: Extracting a.gz ...
patool: ... a.gz extracted to `.'.
Out[13]: '.'
patoolib.extract_archive("a.gz",verbosity=1,outdir=".")
patool: Extracting a.gz ...
patool: ... a.gz extracted to `.'.
Out[14]: '.'
os.listdir()
Out[15]: ['a', 'a.gz', 'a1', 'a2']
So, again, the extract_archive
function is creating new files everytime it is executed. The file archived under a.gz
has a different name from a
actually.
As you've stated, patoolib is intended to be a generic archive tool.
Various archive types can be created, extracted, tested, listed, compared, searched and repacked with patool. The advantage of patool is its simplicity in handling archive files without having to remember a myriad of programs and options.
Generic Extract Behaviour vs Specific Extract Behaviour
The problem here is that extract_archive
does not expose the ability to modify the underlying default behaviour of the archive tool extensively.
For a .zip extension, patoolib will use unzip. You can have the desired behaviour of extracting the archive by passing -o as an option to the command line interface. i.e. unzip -o ...
However, this is a specific command line option for unzip, and this changes for each archive utility.
For example tar offers an overwrite option, but no shortened command line equivalent as zip. i.e. tar --overwrite
but tar -o
does not have the intended effect.
To fix this issue you could make a feature request to the author, or use an alternative library. Unfortunately, the mantra of patoolib would require extending all extract utility functions to then implement the underlying extractors own overwrite command options.
Example Changes to patoolib
In patoolib.programs.unzip
def extract_zip (archive, compression, cmd, verbosity, outdir, overwrite=False):
"""Extract a ZIP archive."""
cmdlist = [cmd]
if verbosity > 1:
cmdlist.append('-v')
if overwrite:
cmdlist.append('-o')
cmdlist.extend(['--', archive, '-d', outdir])
return cmdlist
In patoolib.programs.tar
def extract_tar (archive, compression, cmd, verbosity, outdir, overwrite=False):
"""Extract a TAR archive."""
cmdlist = [cmd, '--extract']
if overwrite:
cmdlist.append('--overwrite')
add_tar_opts(cmdlist, compression, verbosity)
cmdlist.extend(["--file", archive, '--directory', outdir])
return cmdlist
It's not a trivial change to update every program, each program is different!
Monkey patching overwrite behavior
So you've decided to not improve the patoolib source code... We can overwrite the behaviour of extract_archive
to initially look for an existing directory, remove it, then call the original extract_archive
.
You could include this code in your modules, if many modules require it, perhaps stick it __init__.py
import os
import patoolib
from shutil import rmtree
def overwrite_then_extract_archive(archive, verbosity=0, outdir=None, program=None):
if outdir:
if os.path.exists(outdir):
shutil.rmtree(outdir)
patoolib.extract_archive(archive, verbosity, outdir, program)
patoolib.extract_archive = overwrite_then_extract_archive
Now when we call extract_archive()
we have the functionality of overwrite_then_extract_archive()
.