urllib2 HTTP error 429

Florin Stingaciu picture Florin Stingaciu · Nov 3, 2012 · Viewed 17k times · Source

So I have a list of sub-reddits and I'm using urllib to open them. As I go through them eventually urllib fails with:

urllib2.HTTPError: HTTP Error 429: Unknown

Doing some research I found that reddit limits the ammount of requests to their servers by IP:

Make no more than one request every two seconds. There's some allowance for bursts of requests, but keep it sane. In general, keep it to no more than 30 requests in a minute.

So I figured I'd use time.sleep() to limit my requests to one page each 10 seconds. This ends up failing just as well.

The quote above is grabbed from the reddit API page. I am not using the reddit API. At this point I'm thinking two things. Either that limit applies only to the reddit API or urllib also has a limit.

Does anyone know which one of these two things it is? Or how I could go around this issue?

Answer

Anonymous Coward picture Anonymous Coward · Nov 3, 2012

From https://github.com/reddit/reddit/wiki/API:

Many default User-Agents (like "Python/urllib" or "Java") are drastically limited to encourage unique and descriptive user-agent strings.

This applies to regular requests as well. You need to supply your own user agent header when making the request.

#TODO: change user agent string
hdr = { 'User-Agent' : 'super happy flair bot by /u/spladug' }
req = urllib2.Request(url, headers=hdr)
html = urllib2.urlopen(req).read()

However, this will create a new connection for every request. I suggest using another library that is capable of re-using connections, httplib or Request, for example. It will put less stress on the server and speed up the requests:

import httplib
import time

lst = """
science
scifi
"""

hdr= { 'User-Agent' : 'super happy flair bot by /u/spladug' }
conn = httplib.HTTPConnection('www.reddit.com')
for name in lst.split():
    conn.request('GET', '/r/'+name, headers=hdr)
    print conn.getresponse().read()
    time.sleep(2)
conn.close()