I have code for reading an url like this:
from urllib2 import Request, urlopen
req = Request(url)
for key, val in headers.items():
req.add_header(key, val)
res = urlopen(req, timeout = timeout)
# This line blocks
content = res.read()
The timeout works for the urlopen() call. But then the code gets to the res.read() call where I want to read the response data and the timeout isn't applied there. So the read call may hang almost forever waiting for data from the server. The only solution I've found is to use a signal to interrupt the read() which is not suitable for me since I'm using threads.
What other options are there? Is there a HTTP library for Python that handles read timeouts? I've looked at httplib2 and requests and they seem to suffer the same issue as above. I don't want to write my own nonblocking network code using the socket module because I think there should already be a library for this.
Update: None of the solutions below are doing it for me. You can see for yourself that setting the socket or urlopen timeout has no effect when downloading a large file:
from urllib2 import urlopen
url = 'http://iso.linuxquestions.org/download/388/7163/http/se.releases.ubuntu.com/ubuntu-12.04.3-desktop-i386.iso'
c = urlopen(url)
c.read()
At least on Windows with Python 2.7.3, the timeouts are being completely ignored.
It's not possible for any library to do this without using some kind of asynchronous timer through threads or otherwise. The reason is that the timeout
parameter used in httplib
, urllib2
and other libraries sets the timeout
on the underlying socket
. And what this actually does is explained in the documentation.
SO_RCVTIMEO
Sets the timeout value that specifies the maximum amount of time an input function waits until it completes. It accepts a timeval structure with the number of seconds and microseconds specifying the limit on how long to wait for an input operation to complete. If a receive operation has blocked for this much time without receiving additional data, it shall return with a partial count or errno set to [EAGAIN] or [EWOULDBLOCK] if no data is received.
The bolded part is key. A socket.timeout
is only raised if not a single byte has been received for the duration of the timeout
window. In other words, this is a timeout
between received bytes.
A simple function using threading.Timer
could be as follows.
import httplib
import socket
import threading
def download(host, path, timeout = 10):
content = None
http = httplib.HTTPConnection(host)
http.request('GET', path)
response = http.getresponse()
timer = threading.Timer(timeout, http.sock.shutdown, [socket.SHUT_RD])
timer.start()
try:
content = response.read()
except httplib.IncompleteRead:
pass
timer.cancel() # cancel on triggered Timer is safe
http.close()
return content
>>> host = 'releases.ubuntu.com'
>>> content = download(host, '/15.04/ubuntu-15.04-desktop-amd64.iso', 1)
>>> print content is None
True
>>> content = download(host, '/15.04/MD5SUMS', 1)
>>> print content is None
False
Other than checking for None
, it's also possible to catch the httplib.IncompleteRead
exception not inside the function, but outside of it. The latter case will not work though if the HTTP request doesn't have a Content-Length
header.