Login to website using urllib2 - Python 2.7

tommo picture tommo · Dec 18, 2012 · Viewed 54.4k times · Source

Okay, so I am using this for a reddit bot, but I want to be able to figure out HOW to log in to any website. If that makes sense....

I realise that different websites use different login forms etc. So how do I figure out how to optimise it for each website? I'm assuming I need to look for something in the html file but no idea what.

I do NOT want to use Mechanize or any other library (which is what all the other answers are about on here and don't actually help me to learn what is happening), as I want to learn by myself how exactly it all works.

The urllib2 documentation really isn't helping me.

Thanks.

Answer

RocketDonkey picture RocketDonkey · Dec 19, 2012

I'll preface this by saying I haven't done logging in in this way for a while, so I could be missing some of the more 'accepted' ways to do it.

I'm not sure if this is what you're after, but without a library like mechanize or a more robust framework like selenium, in the basic case you just look at the form itself and seek out the inputs. For instance, looking at www.reddit.com, and then viewing the source of the rendered page, you will find this form:

<form method="post" action="https://ssl.reddit.com/post/login" id="login_login-main"
  class="login-form login-form-side">
    <input type="hidden" name="op" value="login-main" />
    <input name="user" placeholder="username" type="text" maxlength="20" tabindex="1" />
    <input name="passwd" placeholder="password" type="password" tabindex="1" />

    <div class="status"></div>

    <div id="remember-me">
      <input type="checkbox" name="rem" id="rem-login-main" tabindex="1" />
      <label for="rem-login-main">remember me</label>
      <a class="recover-password" href="/password">reset password</a>
    </div>

    <div class="submit">
      <button class="btn" type="submit" tabindex="1">login</button>
    </div>

    <div class="clear"></div>
</form>

Here we see a few input's - op, user, passwd and rem. Also, notice the action parameter - that is the URL to which the form will be posted, and will therefore be our target. So now the last step is packing the parameters into a payload and sending it as a POST request to the action URL. Also below, we create a new opener, add the ability to handle cookies and add headers as well, giving us a slightly more robust opener to execute the requests):

import cookielib
import urllib
import urllib2


# Store the cookies and create an opener that will hold them
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

# Add our headers
opener.addheaders = [('User-agent', 'RedditTesting')]

# Install our opener (note that this changes the global opener to the one
# we just made, but you can also just call opener.open() if you want)
urllib2.install_opener(opener)

# The action/ target from the form
authentication_url = 'https://ssl.reddit.com/post/login'

# Input parameters we are going to send
payload = {
  'op': 'login-main',
  'user': '<username>',
  'passwd': '<password>'
  }

# Use urllib to encode the payload
data = urllib.urlencode(payload)

# Build our Request object (supplying 'data' makes it a POST)
req = urllib2.Request(authentication_url, data)

# Make the request and read the response
resp = urllib2.urlopen(req)
contents = resp.read()

Note that this can get much more complicated - you can also do this with GMail, for instance, but you need to pull in parameters that will change every time (such as the GALX parameter). Again, not sure if this is what you wanted, but hope it helps.