Cleaner way to read/gunzip a huge file in python

LittleBobbyTables picture LittleBobbyTables · Feb 1, 2013 · Viewed 33.3k times · Source

So I have some fairly gigantic .gz files - we're talking 10 to 20 gb each when decompressed.

I need to loop through each line of them, so I'm using the standard:

import gzip
f = gzip.open(path+myFile, 'r')
for line in f.readlines():
    #(yadda yadda)
f.close()

However, both the open() and close() commands take AGES, using up 98% of the memory+CPU. So much so that the program exits and prints Killed to the terminal. Maybe it is loading the entire extracted file into memory?

I'm now using something like:

from subprocess import call
f = open(path+'myfile.txt', 'w')
call(['gunzip', '-c', path+myfile], stdout=f)
#do some looping through the file
f.close()
#then delete extracted file

This works. But is there a cleaner way?

Answer

abarnert picture abarnert · Feb 1, 2013

I'm 99% sure that your problem is not in the gzip.open(), but in the readlines().

As the documentation explains:

f.readlines() returns a list containing all the lines of data in the file.

Obviously, that requires reading reading and decompressing the entire file, and building up an absolutely gigantic list.

Most likely, it's actually the malloc calls to allocate all that memory that are taking forever. And then, at the end of this scope (assuming you're using CPython), it has to GC that whole gigantic list, which will also take forever.

You almost never want to use readlines. Unless you're using a very old Python, just do this:

for line in f:

A file is an iterable full of lines, just like the list returned by readlines—except that it's not actually a list, it generates more lines on the fly by reading out of a buffer. So, at any given time, you'll only have one line and a couple of buffers on the order of 10MB each, instead of a 25GB list. And the reading and decompressing will be spread out over the lifetime of the loop, instead of done all at once.

From a quick test, with a 3.5GB gzip file, gzip.open() is effectively instant, for line in f: pass takes a few seconds, gzip.close() is effectively instant. But if I do for line in f.readlines(): pass, it takes… well, I'm not sure how long, because after about a minute my system went into swap thrashing hell and I had to force-kill the interpreter to get it to respond to anything…


Since this has come up a dozen more times since this answer, I wrote this blog post which explains a bit more.