HTTP Download very Big File

user186473 picture user186473 · Oct 8, 2009 · Viewed 8.9k times · Source

I'm working at a web application in Python/Twisted.

I want the user to be able to download a very big file (> 100 Mb). I don't want to load all the file in memory (of the server), of course.

server side I have this idea:

...
request.setHeader('Content-Type', 'text/plain')
fp = open(fileName, 'rb')
try:
    r = None
    while r != '':
        r = fp.read(1024)
        request.write(r)
finally:
    fp.close()
    request.finish()

I expected this to work, but I have problems: I'm testing with FF... It seems the browser make me wait until the file is completed downloaded, and then I have the open/save dialog box.

I expected the dialog box immediately, and then the progress bar in action...

Maybe I have to add something in the Http header... Something like the size of the file?

Answer

Jean-Paul Calderone picture Jean-Paul Calderone · Nov 1, 2009

Two big problems with the sample code you posted are that it is non-cooperative and it loads the entire file into memory before sending it.

while r != '':
    r = fp.read(1024)
    request.write(r)

Remember that Twisted uses cooperative multitasking to achieve any sort of concurrency. So the first problem with this snippet is that it is a while loop over the contents of an entire file (which you say is large). This means the entire file will be read into memory and written to the response before anything else can happen in the process. In this case, it happens that "anything" also includes pushing the bytes from the in-memory buffer onto the network, so your code will also hold the entire file in memory at once and only start to get rid of it when this loop completes.

So, as a general rule, you shouldn't write code for use in a Twisted-based application that uses a loop like this to do a big job. Instead, you need to do each small piece of the big job in a way that will cooperate with the event loop. For sending a file over the network, the best way to approach this is with producers and consumers. These are two related APIs for moving large amounts of data around using buffer-empty events to do it efficiently and without wasting unreasonable amounts of memory.

You can find some documentation of these APIs here:

http://twistedmatrix.com/projects/core/documentation/howto/producers.html

Fortunately, for this very common case, there is also a producer written already that you can use, rather than implementing your own:

http://twistedmatrix.com/documents/current/api/twisted.protocols.basic.FileSender.html

You probably want to use it sort of like this:

from twisted.protocols.basic import FileSender
from twisted.python.log import err
from twisted.web.server import NOT_DONE_YET

class Something(Resource):
    ...

    def render_GET(self, request):
        request.setHeader('Content-Type', 'text/plain')
        fp = open(fileName, 'rb')
        d = FileSender().beginFileTransfer(fp, request)
        def cbFinished(ignored):
            fp.close()
            request.finish()
        d.addErrback(err).addCallback(cbFinished)
        return NOT_DONE_YET

You can read more about NOT_DONE_YET and other related ideas the "Twisted Web in 60 Seconds" series on my blog, http://jcalderone.livejournal.com/50562.html (see the "asynchronous responses" entries in particular).