Checking tarfile integrity in Python

Kaurin picture Kaurin · Apr 15, 2013 · Viewed 8.6k times · Source

I'm working on converting my backup script from shell to Python. One of the features of my old script was to check the created tarfile for integrity by doing: gzip -t .

This seems to be a bit tricky in Python.

It seems that the only way to do this, is by reading each of the compressed TarInfo objects within the tarfile.

Is there a way to check a tarfile for integrity, without extracting to disk, or keeping it in memory (in it's entirety)?

Good people on #python on freenode suggested that I should read each TarInfo object chunk-by-chunk, discarding each chunk read.

I must admit that I have no idea how to do this, seeing that I just started Python.

Imagine that I have a tarfile of 30GB which contains files ranging from 1kb to 10GB...

This is the solution that I started writing:

try:
    tardude = tarfile.open("zero.tar.gz")
except:
    print "There was an error opening tarfile. The file might be corrupt or missing."

for member_info in tardude.getmembers():
    try:
        check = tardude.extractfile(member_info.name)
    except:
        print "File: %r is corrupt." % member_info.name

tardude.close()

This code is far from finished. I would not dare running this on a huge 30GB tar archive, because at one point, check would be an object of 10+GB (If i have such huge files within the tar archive)

Bonus: I tried manually corrupting zero.tar.gz (hex editor - edit a few bytes midfile). The first except does not catch IOError... Here is the output:

Traceback (most recent call last):
  File "./test.py", line 31, in <module>
    for member_info in tardude.getmembers():
  File "/usr/lib/python2.7/tarfile.py", line 1805, in getmembers
    self._load()        # all members, we first have to
  File "/usr/lib/python2.7/tarfile.py", line 2380, in _load
    tarinfo = self.next()
  File "/usr/lib/python2.7/tarfile.py", line 2315, in next
    self.fileobj.seek(self.offset)
  File "/usr/lib/python2.7/gzip.py", line 429, in seek
    self.read(1024)
  File "/usr/lib/python2.7/gzip.py", line 256, in read
    self._read(readsize)
  File "/usr/lib/python2.7/gzip.py", line 320, in _read
    self._read_eof()
  File "/usr/lib/python2.7/gzip.py", line 342, in _read_eof
    hex(self.crc)))
IOError: CRC check failed 0xe5384b87 != 0xdfe91e1L

Answer

ZachP picture ZachP · Aug 31, 2015

Just a minor improvement on Aya's answer to make things a little more idiomatic (although I'm removing some of the error checking to make the mechanics more visible):

BLOCK_SIZE = 1024

with tarfile.open("zero.tar.gz") as tardude:
    for member in tardude.getmembers():
        with tardude.extractfile(member.name) as target:
            for chunk in iter(lambda: target.read(BLOCK_SIZE), b''):
                pass

This really just removes the while 1: (sometimes considered a minor code smell) and the if not data: check. Also note that the use of with restricts this to Python 2.7+