Handling IncompleteRead,URLError

from __future__ picture from __future__ · Aug 13, 2012 · Viewed 7.2k times · Source

it's a piece of web mining script.

def printer(q,missing):
    while 1:
        tmpurl=q.get()
        try:
            image=urllib2.urlopen(tmpurl).read()
        except httplib.HTTPException:
            missing.put(tmpurl)
            continue
        wf=open(tmpurl[-35:]+".jpg","wb")
        wf.write(image)
        wf.close()

q is a Queue() composed of Urls and `missing is an empty queue to gather error-raising-urls

it runs in parallel by 10 threads.

and everytime I run this, I got this.

  File "C:\Python27\lib\socket.py", line 351, in read
    data = self._sock.recv(rbufsize)
  File "C:\Python27\lib\httplib.py", line 541, in read
    return self._read_chunked(amt)
  File "C:\Python27\lib\httplib.py", line 592, in _read_chunked
    value.append(self._safe_read(amt))
  File "C:\Python27\lib\httplib.py", line 649, in _safe_read
    raise IncompleteRead(''.join(s), amt)
IncompleteRead: IncompleteRead(5274 bytes read, 2918 more expected)

but I do use the except... I tried something else like

httplib.IncompleteRead
urllib2.URLError

even,

image=urllib2.urlopen(tmpurl,timeout=999999).read()

but none of this is working..

how can I catch the IncompleteRead and URLError?

Answer

Michael Leonard picture Michael Leonard · Oct 21, 2015

I think the correct answer to this question depends on what you consider an "error-raising URL".

Methods of catching multiple exceptions

If you think any URL which raises an exception should be added to the missing queue then you can do:

try:
    image=urllib2.urlopen(tmpurl).read()
except (httplib.HTTPException, httplib.IncompleteRead, urllib2.URLError):
    missing.put(tmpurl)
    continue

This will catch any of those three exceptions and add that url to the missing queue. More simply you could do:

try:
    image=urllib2.urlopen(tmpurl).read()
except:
    missing.put(tmpurl)
    continue

To catch any exception but this is not considered Pythonic and could hide other possible errors in your code.

If by "error-raising URL" you mean any URL that raises an httplib.HTTPException error but you'd still like to keep processing if the other errors are received then you can do:

try:
    image=urllib2.urlopen(tmpurl).read()
except httplib.HTTPException:
    missing.put(tmpurl)
    continue
except (httplib.IncompleteRead, urllib2.URLError):
    continue

This will only add the URL to the missing queue if it raises an httplib.HTTPException but will otherwise catch httplib.IncompleteRead and urllib.URLError and keep your script from crashing.

Iterating over a Queue

As an aside, while 1 loops are always a bit concerning to me. You should be able to loop through the Queue contents using the following pattern, though you're free to continue doing it your way:

for tmpurl in iter(q, "STOP"):
    # rest of your code goes here
    pass

Safely working with files

As another aside, unless it's absolutely necessary to do otherwise, you should use context managers to open and modify files. So your three file-operation lines would become:

with open(tmpurl[-35:]+".jpg","wb") as wf:
    wf.write()

The context manager takes care of closing the file, and will do so even if an exception occurs while writing to the file.