Strange "BadZipfile: Bad CRC-32" problem

Marc Abramowitz picture Marc Abramowitz · Apr 11, 2011 · Viewed 14.7k times · Source

This code is simplification of code in a Django app that receives an uploaded zip file via HTTP multi-part POST and does read-only processing of the data inside:

#!/usr/bin/env python

import csv, sys, StringIO, traceback, zipfile
try:
    import io
except ImportError:
    sys.stderr.write('Could not import the `io` module.\n')

def get_zip_file(filename, method):
    if method == 'direct':
        return zipfile.ZipFile(filename)
    elif method == 'StringIO':
        data = file(filename).read()
        return zipfile.ZipFile(StringIO.StringIO(data))
    elif method == 'BytesIO':
        data = file(filename).read()
        return zipfile.ZipFile(io.BytesIO(data))


def process_zip_file(filename, method, open_defaults_file):
    zip_file    = get_zip_file(filename, method)
    items_file  = zip_file.open('items.csv')
    csv_file    = csv.DictReader(items_file)

    try:
        for idx, row in enumerate(csv_file):
            image_filename = row['image1']

            if open_defaults_file:
                z = zip_file.open('defaults.csv')
                z.close()

        sys.stdout.write('Processed %d items.\n' % idx)
    except zipfile.BadZipfile:
        sys.stderr.write('Processing failed on item %d\n\n%s' 
                         % (idx, traceback.format_exc()))


process_zip_file(sys.argv[1], sys.argv[2], int(sys.argv[3]))

Pretty simple. We open the zip file and one or two CSV files inside the zip file.

What's weird is that if I run this with a large zip file (~13 MB) and have it instantiate the ZipFile from a StringIO.StringIO or a io.BytesIO (Perhaps anything other than a plain filename? I had similar problems in the Django app when trying to create a ZipFile from a TemporaryUploadedFile or even a file object created by calling os.tmpfile() and shutil.copyfileobj()) and have it open TWO csv files rather than just one, then it fails towards the end of processing. Here's the output that I see on a Linux system:

$ ./test_zip_file.py ~/data.zip direct 1
Processed 250 items.

$ ./test_zip_file.py ~/data.zip StringIO 1
Processing failed on item 242

Traceback (most recent call last):
  File "./test_zip_file.py", line 26, in process_zip_file
    for idx, row in enumerate(csv_file):
  File ".../python2.7/csv.py", line 104, in next
    row = self.reader.next()
  File ".../python2.7/zipfile.py", line 523, in readline
    return io.BufferedIOBase.readline(self, limit)
  File ".../python2.7/zipfile.py", line 561, in peek
    chunk = self.read(n)
  File ".../python2.7/zipfile.py", line 581, in read
    data = self.read1(n - len(buf))
  File ".../python2.7/zipfile.py", line 641, in read1
    self._update_crc(data, eof=eof)
  File ".../python2.7/zipfile.py", line 596, in _update_crc
    raise BadZipfile("Bad CRC-32 for file %r" % self.name)
BadZipfile: Bad CRC-32 for file 'items.csv'

$ ./test_zip_file.py ~/data.zip BytesIO 1
Processing failed on item 242

Traceback (most recent call last):
  File "./test_zip_file.py", line 26, in process_zip_file
    for idx, row in enumerate(csv_file):
  File ".../python2.7/csv.py", line 104, in next
    row = self.reader.next()
  File ".../python2.7/zipfile.py", line 523, in readline
    return io.BufferedIOBase.readline(self, limit)
  File ".../python2.7/zipfile.py", line 561, in peek
    chunk = self.read(n)
  File ".../python2.7/zipfile.py", line 581, in read
    data = self.read1(n - len(buf))
  File ".../python2.7/zipfile.py", line 641, in read1
    self._update_crc(data, eof=eof)
  File ".../python2.7/zipfile.py", line 596, in _update_crc
    raise BadZipfile("Bad CRC-32 for file %r" % self.name)
BadZipfile: Bad CRC-32 for file 'items.csv'

$ ./test_zip_file.py ~/data.zip StringIO 0
Processed 250 items.

$ ./test_zip_file.py ~/data.zip BytesIO 0
Processed 250 items.

Incidentally, the code fails under the same conditions but in a different way on my OS X system. Instead of the BadZipfile exception, it seems to read corrupted data and gets very confused.

This all suggests to me that I am doing something in this code that you are not supposed to do -- e.g.: call zipfile.open on a file while already having another file within the same zip file object open? This doesn't seem to be a problem when using ZipFile(filename), but perhaps it's problematic when passing ZipFile a file-like object, because of some implementation details in the zipfile module?

Perhaps I missed something in the zipfile docs? Or maybe it's not documented yet? Or (least likely), a bug in the zipfile module?

Answer

Marc Abramowitz picture Marc Abramowitz · Apr 11, 2011

I might have just found the problem and the solution, but unfortunately I had to replace Python's zipfile module with a hacked one of my own (called myzipfile here).

$ diff -u ~/run/lib/python2.7/zipfile.py myzipfile.py
--- /home/msabramo/run/lib/python2.7/zipfile.py 2010-12-22 17:02:34.000000000 -0800
+++ myzipfile.py        2011-04-11 11:51:59.000000000 -0700
@@ -5,6 +5,7 @@
 import binascii, cStringIO, stat
 import io
 import re
+import copy

 try:
     import zlib # We may need its compression method
@@ -877,7 +878,7 @@
         # Only open a new file for instances where we were not
         # given a file object in the constructor
         if self._filePassed:
-            zef_file = self.fp
+            zef_file = copy.copy(self.fp)
         else:
             zef_file = open(self.filename, 'rb')

The problem in the standard zipfile module is that when passed a file object (not a filename), it uses that same passed-in file object for every call to the open method. This means that tell and seek are getting called on the same file and so trying to open multiple files within the zip file is causing the file position to be shared and so multiple open calls result in them stepping all over each other. In contrast, when passed a filename, open opens a new file object. My solution is for the case when a file object is passed in, instead of using that file object directly, I create a copy of it.

This change to zipfile fixes the problems I was seeing:

$ ./test_zip_file.py ~/data.zip StringIO 1
Processed 250 items.

$ ./test_zip_file.py ~/data.zip BytesIO 1
Processed 250 items.

$ ./test_zip_file.py ~/data.zip direct 1
Processed 250 items.

but I don't know if it has other negative impacts on zipfile...

EDIT: I just found a mention of this in the Python docs that I had somehow overlooked before. At http://docs.python.org/library/zipfile.html#zipfile.ZipFile.open, it says:

Note: If the ZipFile was created by passing in a file-like object as the first argument to the constructor, then the object returned by open() shares the ZipFile’s file pointer. Under these circumstances, the object returned by open() should not be used after any additional operations are performed on the ZipFile object. If the ZipFile was created by passing in a string (the filename) as the first argument to the constructor, then open() will create a new file object that will be held by the ZipExtFile, allowing it to operate independently of the ZipFile.