This code is simplification of code in a Django app that receives an uploaded zip file via HTTP multi-part POST and does read-only processing of the data inside:
#!/usr/bin/env python
import csv, sys, StringIO, traceback, zipfile
try:
import io
except ImportError:
sys.stderr.write('Could not import the `io` module.\n')
def get_zip_file(filename, method):
if method == 'direct':
return zipfile.ZipFile(filename)
elif method == 'StringIO':
data = file(filename).read()
return zipfile.ZipFile(StringIO.StringIO(data))
elif method == 'BytesIO':
data = file(filename).read()
return zipfile.ZipFile(io.BytesIO(data))
def process_zip_file(filename, method, open_defaults_file):
zip_file = get_zip_file(filename, method)
items_file = zip_file.open('items.csv')
csv_file = csv.DictReader(items_file)
try:
for idx, row in enumerate(csv_file):
image_filename = row['image1']
if open_defaults_file:
z = zip_file.open('defaults.csv')
z.close()
sys.stdout.write('Processed %d items.\n' % idx)
except zipfile.BadZipfile:
sys.stderr.write('Processing failed on item %d\n\n%s'
% (idx, traceback.format_exc()))
process_zip_file(sys.argv[1], sys.argv[2], int(sys.argv[3]))
Pretty simple. We open the zip file and one or two CSV files inside the zip file.
What's weird is that if I run this with a large zip file (~13 MB) and have it instantiate the ZipFile
from a StringIO.StringIO
or a io.BytesIO
(Perhaps anything other than a plain filename? I had similar problems in the Django app when trying to create a ZipFile
from a TemporaryUploadedFile
or even a file object created by calling os.tmpfile()
and shutil.copyfileobj()
) and have it open TWO csv files rather than just one, then it fails towards the end of processing. Here's the output that I see on a Linux system:
$ ./test_zip_file.py ~/data.zip direct 1
Processed 250 items.
$ ./test_zip_file.py ~/data.zip StringIO 1
Processing failed on item 242
Traceback (most recent call last):
File "./test_zip_file.py", line 26, in process_zip_file
for idx, row in enumerate(csv_file):
File ".../python2.7/csv.py", line 104, in next
row = self.reader.next()
File ".../python2.7/zipfile.py", line 523, in readline
return io.BufferedIOBase.readline(self, limit)
File ".../python2.7/zipfile.py", line 561, in peek
chunk = self.read(n)
File ".../python2.7/zipfile.py", line 581, in read
data = self.read1(n - len(buf))
File ".../python2.7/zipfile.py", line 641, in read1
self._update_crc(data, eof=eof)
File ".../python2.7/zipfile.py", line 596, in _update_crc
raise BadZipfile("Bad CRC-32 for file %r" % self.name)
BadZipfile: Bad CRC-32 for file 'items.csv'
$ ./test_zip_file.py ~/data.zip BytesIO 1
Processing failed on item 242
Traceback (most recent call last):
File "./test_zip_file.py", line 26, in process_zip_file
for idx, row in enumerate(csv_file):
File ".../python2.7/csv.py", line 104, in next
row = self.reader.next()
File ".../python2.7/zipfile.py", line 523, in readline
return io.BufferedIOBase.readline(self, limit)
File ".../python2.7/zipfile.py", line 561, in peek
chunk = self.read(n)
File ".../python2.7/zipfile.py", line 581, in read
data = self.read1(n - len(buf))
File ".../python2.7/zipfile.py", line 641, in read1
self._update_crc(data, eof=eof)
File ".../python2.7/zipfile.py", line 596, in _update_crc
raise BadZipfile("Bad CRC-32 for file %r" % self.name)
BadZipfile: Bad CRC-32 for file 'items.csv'
$ ./test_zip_file.py ~/data.zip StringIO 0
Processed 250 items.
$ ./test_zip_file.py ~/data.zip BytesIO 0
Processed 250 items.
Incidentally, the code fails under the same conditions but in a different way on my OS X system. Instead of the BadZipfile
exception, it seems to read corrupted data and gets very confused.
This all suggests to me that I am doing something in this code that you are not supposed to do -- e.g.: call zipfile.open
on a file while already having another file within the same zip file object open? This doesn't seem to be a problem when using ZipFile(filename)
, but perhaps it's problematic when passing ZipFile
a file-like object, because of some implementation details in the zipfile
module?
Perhaps I missed something in the zipfile
docs? Or maybe it's not documented yet? Or (least likely), a bug in the zipfile
module?
I might have just found the problem and the solution, but unfortunately I had to replace Python's zipfile
module with a hacked one of my own (called myzipfile
here).
$ diff -u ~/run/lib/python2.7/zipfile.py myzipfile.py
--- /home/msabramo/run/lib/python2.7/zipfile.py 2010-12-22 17:02:34.000000000 -0800
+++ myzipfile.py 2011-04-11 11:51:59.000000000 -0700
@@ -5,6 +5,7 @@
import binascii, cStringIO, stat
import io
import re
+import copy
try:
import zlib # We may need its compression method
@@ -877,7 +878,7 @@
# Only open a new file for instances where we were not
# given a file object in the constructor
if self._filePassed:
- zef_file = self.fp
+ zef_file = copy.copy(self.fp)
else:
zef_file = open(self.filename, 'rb')
The problem in the standard zipfile
module is that when passed a file object (not a filename), it uses that same passed-in file object for every call to the open
method. This means that tell
and seek
are getting called on the same file and so trying to open multiple files within the zip file is causing the file position to be shared and so multiple open
calls result in them stepping all over each other. In contrast, when passed a filename, open
opens a new file object. My solution is for the case when a file object is passed in, instead of using that file object directly, I create a copy of it.
This change to zipfile
fixes the problems I was seeing:
$ ./test_zip_file.py ~/data.zip StringIO 1
Processed 250 items.
$ ./test_zip_file.py ~/data.zip BytesIO 1
Processed 250 items.
$ ./test_zip_file.py ~/data.zip direct 1
Processed 250 items.
but I don't know if it has other negative impacts on zipfile
...
EDIT: I just found a mention of this in the Python docs that I had somehow overlooked before. At http://docs.python.org/library/zipfile.html#zipfile.ZipFile.open, it says:
Note: If the ZipFile was created by passing in a file-like object as the first argument to the constructor, then the object returned by
open()
shares the ZipFile’s file pointer. Under these circumstances, the object returned byopen()
should not be used after any additional operations are performed on the ZipFile object. If the ZipFile was created by passing in a string (the filename) as the first argument to the constructor, thenopen()
will create a new file object that will be held by the ZipExtFile, allowing it to operate independently of the ZipFile.