it's a piece of web mining script.
def printer(q,missing):
while 1:
tmpurl=q.get()
try:
image=urllib2.urlopen(tmpurl).read()
except httplib.HTTPException:
missing.put(tmpurl)
continue
wf=open(tmpurl[-35:]+".jpg","wb")
wf.write(image)
wf.close()
q
is a Queue()
composed of Urls and `missing is an empty queue to gather error-raising-urls
it runs in parallel by 10 threads.
and everytime I run this, I got this.
File "C:\Python27\lib\socket.py", line 351, in read
data = self._sock.recv(rbufsize)
File "C:\Python27\lib\httplib.py", line 541, in read
return self._read_chunked(amt)
File "C:\Python27\lib\httplib.py", line 592, in _read_chunked
value.append(self._safe_read(amt))
File "C:\Python27\lib\httplib.py", line 649, in _safe_read
raise IncompleteRead(''.join(s), amt)
IncompleteRead: IncompleteRead(5274 bytes read, 2918 more expected)
but I do use the except
...
I tried something else like
httplib.IncompleteRead
urllib2.URLError
even,
image=urllib2.urlopen(tmpurl,timeout=999999).read()
but none of this is working..
how can I catch the IncompleteRead
and URLError
?
I think the correct answer to this question depends on what you consider an "error-raising URL".
If you think any URL which raises an exception should be added to the missing
queue then you can do:
try:
image=urllib2.urlopen(tmpurl).read()
except (httplib.HTTPException, httplib.IncompleteRead, urllib2.URLError):
missing.put(tmpurl)
continue
This will catch any of those three exceptions and add that url to the missing
queue. More simply you could do:
try:
image=urllib2.urlopen(tmpurl).read()
except:
missing.put(tmpurl)
continue
To catch any exception but this is not considered Pythonic and could hide other possible errors in your code.
If by "error-raising URL" you mean any URL that raises an httplib.HTTPException
error but you'd still like to keep processing if the other errors are received then you can do:
try:
image=urllib2.urlopen(tmpurl).read()
except httplib.HTTPException:
missing.put(tmpurl)
continue
except (httplib.IncompleteRead, urllib2.URLError):
continue
This will only add the URL to the missing
queue if it raises an httplib.HTTPException
but will otherwise catch httplib.IncompleteRead
and urllib.URLError
and keep your script from crashing.
As an aside, while 1
loops are always a bit concerning to me. You should be able to loop through the Queue contents using the following pattern, though you're free to continue doing it your way:
for tmpurl in iter(q, "STOP"):
# rest of your code goes here
pass
As another aside, unless it's absolutely necessary to do otherwise, you should use context managers to open and modify files. So your three file-operation lines would become:
with open(tmpurl[-35:]+".jpg","wb") as wf:
wf.write()
The context manager takes care of closing the file, and will do so even if an exception occurs while writing to the file.