Using grequests to make several thousand get requests to sourceforge, get "Max retries exceeded with url"

crf picture crf · Feb 24, 2014 · Viewed 12.1k times · Source

I am very new to all of this; I need to obtain data on several thousand sourceforge projects for a paper I am writing. The data is all freely available in json format at the url http://sourceforge.net/api/project/name/[project name]/json. I have a list of several thousand of these URL's and I am using the following code.

import grequests
rs = (grequests.get(u) for u in ulist)
answers = grequests.map(rs)

Using this code I am able to obtain the data for any 200 or so projects I like, i.e. rs = (grequests.get(u) for u in ulist[0:199]) works, but as soon as I go over that, all attempts are met with

ConnectionError: HTTPConnectionPool(host='sourceforge.net', port=80): Max retries exceeded with url: /api/project/name/p2p-fs/json (Caused by <class 'socket.gaierror'>: [Errno 8] nodename nor servname provided, or not known)
<Greenlet at 0x109b790f0: <bound method AsyncRequest.send of <grequests.AsyncRequest object at 0x10999ef50>>(stream=False)> failed with ConnectionError

I am then unable to make any more requests until I quit python, but as soon as I restart python I can make another 200 requests.

I've tried using grequests.map(rs,size=200) but this seems to do nothing.

Answer

Virgil picture Virgil · Apr 3, 2014

In my case, it was not rate limiting by the destination server, but something much simpler: I didn't explicitly close the responses, so they were keeping the socket open, and the python process ran out of file handles.

My solution (don't know for sure which one fixed the issue - theoretically either of them should) was to:

  • Set stream=False in grequests.get:

     rs = (grequests.get(u, stream=False) for u in urls)
    
  • Call explicitly response.close() after I read response.content:

     responses = grequests.map(rs)
     for response in responses:
           make_use_of(response.content)
           response.close()
    

Note: simply destroying the response object (assigning None to it, calling gc.collect()) was not enough - this did not close the file handles.