How to free memory after opening a file in Python

Pierre Mourlanne picture Pierre Mourlanne · Sep 14, 2012 · Viewed 13.5k times · Source

I'm opening a 3 GB file in Python to read strings. I then store this data in a dictionary. My next goal is to build a graph using this dictionary so I'm closely monitoring memory usage.

It seems to me that Python loads the whole 3 GB file into memory and I can't get rid of it. My code looks like that :

with open(filename) as data:

    accounts = dict()

    for line in data:
        username = line.split()[1]
        IP = line.split()[0]

        try:
            accounts[username].add(IP)
        except KeyError:
            accounts[username] = set()
            accounts[username].add(IP)

print "The accounts will be deleted from memory in 5 seconds"
time.sleep(5)
accounts.clear()

print "The accounts have been deleted from memory"
time.sleep(5)

print "End of script"

The last lines are there so that I could monitor memory usage. The script uses a bit more than 3 GB in memory. Clearing the dictionary frees around 300 MB. When the script ends, the rest of the memory is freed.

I'm using Ubuntu and I've monitored memory usage using both "System Monitor" and the "free" command in terminal.

What I don't understand is why does Python need so much memory after I've cleared the dictionary. Is the file still stored in memory ? If so, how can I get rid of it ? Is it a problem with my OS not seeing freed memory ?

EDIT : I've tried to force a gc.collect() after clearing the dictionary, to no avail.

EDIT2 : I'm running Python 2.7.3 on Ubuntu 12.04.LTS

EDIT3 : I realize I forgot to mention something quite important. My real problem is not that my OS does not "get back" the memory used by Python. It's that later on, Python does not seem to reuse that memory (it just asks for more memory to the OS).

Answer

Jonathan Vanasco picture Jonathan Vanasco · Sep 14, 2012

this really does make no sense to me either, and I wanted to figure out how/why this happens. ( i thought that's how this should work too! ) i replicated it on my machine - though with a smaller file.

i saw two discrete problems here

  1. why is Python reading the file into memory ( with lazy line reading, it shouldn't - right ? )
  2. why isn't Python freeing up memory to the system

I'm not knowledgable at all on the Python internals, so I just did a lot of web searching. All of this could be completely off the mark. ( I barely develop anymore , have been on the biz side of tech for the past few years )

Lazy line reading...

I looked around and found this post -

http://www.peterbe.com/plog/blogitem-040312-1

it's from a much earlier version of python, but this line resonated with me:

readlines() reads in the whole file at once and splits it by line.

then i saw this , also old, effbot post:

http://effbot.org/zone/readline-performance.htm

the key takeaway was this:

For example, if you have enough memory, you can slurp the entire file into memory, using the readlines method.

and this:

In Python 2.2 and later, you can loop over the file object itself. This works pretty much like readlines(N) under the covers, but looks much better

looking at pythons docs for xreadlines [ http://docs.python.org/library/stdtypes.html?highlight=readline#file.xreadlines ]:

This method returns the same thing as iter(f) Deprecated since version 2.3: Use for line in file instead.

it made me think that perhaps some slurping is going on.

so if we look at readlines [ http://docs.python.org/library/stdtypes.html?highlight=readline#file.readlines ]...

Read until EOF using readline() and return a list containing the lines thus read.

and it sort of seems like that's what's happening here.

readline , however, looked like what we wanted [ http://docs.python.org/library/stdtypes.html?highlight=readline#file.readline ]

Read one entire line from the file

so i tried switching this to readline, and the process never grew above 40MB ( it was growing to 200MB, the size of the log file , before )

accounts = dict()
data= open(filename)
for line in data.readline():
    info = line.split("LOG:")
    if len(info) == 2 :
        ( a , b ) = info
        try:
            accounts[a].add(True)
        except KeyError:
            accounts[a] = set()
            accounts[a].add(True)

my guess is that we're not really lazy-reading the file with the for x in data construct -- although all the docs and stackoverflow comments suggest that we are. readline() consumed signficantly less memory for me, and realdlines consumed approximately the same amount of memory as for line in data

freeing memory

in terms of freeing up memory, I'm not familiar much with Python's internals, but I recall back from when I worked with mod_perl... if I opened up a file that was 500MB, that apache child grew to that size. if I freed up the memory, it would only be free within that child -- garbage collected memory was never returned to the OS until the process exited.

so i poked around on that idea , and found a few links that suggest this might be happening:

http://effbot.org/pyfaq/why-doesnt-python-release-the-memory-when-i-delete-a-large-object.htm

If you create a large object and delete it again, Python has probably released the memory, but the memory allocators involved don’t necessarily return the memory to the operating system, so it may look as if the Python process uses a lot more virtual memory than it actually uses.

that was sort of old, and I found a bunch of random (accepted) patches afterwards into python that suggested the behavior was changed and that you could now return memory to the os ( as of 2005 when most of those patches were submitted and apparently approved ).

then i found this posting http://objectmix.com/python/17293-python-memory-handling.html -- and note the comment #4

"""- Patch #1123430: Python's small-object allocator now returns an arena to the system free() when all memory within an arena becomes unused again. Prior to Python 2.5, arenas (256KB chunks of memory) were never freed. Some applications will see a drop in virtual memory size now, especially long-running applications that, from time to time, temporarily use a large number of small objects. Note that when Python returns an arena to the platform C's free(), there's no guarantee that the platform C library will in turn return that memory to the operating system. The effect of the patch is to stop making that impossible, and in tests it appears to be effective at least on Microsoft C and gcc-based systems. Thanks to Evan Jones for hard work and patience.

So with 2.4 under linux (as you tested) you will indeed not always get the used memory back, with respect to lots of small objects being collected.

The difference therefore (I think) you see between doing an f.read() and an f.readlines() is that the former reads in the whole file as one large string object (i.e. not a small object), while the latter returns a list of lines where each line is a python object.

if the 'for line in data:' construct is essentially wrapping readlines and not readline, maybe this has something to do with it? perhaps it's not a problem of having a single 3GB object, but instead having millions of 30k objects.