Mechanize in Python - Redirect is not working after submit

Adam Smith picture Adam Smith · Jul 31, 2012 · Viewed 8.2k times · Source

I just started using mechanize in Python and I'm having some problems with it already. I've looked around on StackOverflow and on Google and I've seen people say that the documentation is great and that it should be easy to get it working, but I think I don't know how to look for that documentation since all I can find is code examples which don't really teach me how to do the particular things I'm trying to do. If anyone could point me to such documentation, I'd be glad to read it myself and solve my problem.

For the actual problem, I'm trying to log in to a website by sending my username and password information in a form. When the information is correct, I'm usually redirected, but it doesn't work in mechanize.

This is the part that I don't get, because if I immediately print the html content of the page after calling submit, the page displays a variable that shows that the authentication is valid. If I change the password to an incorrect one, the html shows a message "Invalid credentials", as it would if I were browsing the site normally.

Here's a code sample of how I'm doing it. Keep in mind that it might be totally wrong as I'm only trying to apply what I found in examples:

import mechanize
import cookielib

# Start Browser
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()

br.set_cookiejar(cj)

br.set_handle_equiv(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)

br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)

br.open('http://www.complexejuliequilles.com/')


for l in br.links(url_regex='secure'):
    br.follow_link(l)

br.select_form('form1')

br.form['fldUsername'] = 'myUsername'
br.form['fldPassword'] = 'myPassword'
br.submit()

In this particular example, I open http://www.complexejuliequilles.com, then I follow a link at the bottom that has the text "Administration", where I enter my credentials in the form, then I submit it. Normally, I would be redirected to the first page I was on, but with more buttons that are only usable by administrators. I want to click one of those links to fill out another form to add a list of users of which I have their email addresses, names, etc.

Is there something simple I'm missing? I think I get the basics, but I don't know the library enough to find what is going wrong with the redirect.

Answer

H.D. picture H.D. · Jul 31, 2012

http://wwwsearch.sourceforge.net/mechanize/documentation.html

Avoid using "_http" directly. The first underscore in a name tells us that the developer was thinking on it as something private, and you probably don't need it.

In [20]: mechanize.HTTPRefreshProcessor is mechanize._http.HTTPRefreshProcessor
Out[20]: True

There are some stuff you put before opening the URL that you don't really need. For example: mechanize.Browser() isn't urllib, it already manages the cookies for you. You should not avoid robots.txt. You can follow some more "convention over configuration" by seeing before which handlers are default:

mechanize.Browser().handlers

You probably have mechanize.HTTPRedirectHandler in that list (I do), if not:

br.set_handle_redirect(mechanize.HTTPRedirectHandler)

The for loop is strange, it seems like you're changing its iterator (links inside an open URL) inside the loop (browser opens another URL). I first thought you wanted to click recursively while there's a "secure" URL match. An error would depend on how the links() generator is implemented (probably it follows a fixed br.response() instance), but I think you just want to follow the first link that match:

In [50]: br.follow_link(url_regex="secure") # No loops

I don't know what kind of redirecting/refreshing you need. JavaScript changing window.location.href? If so, mechanize won't do it, unless you parse the JavaScript yourself.

You can get the "raw" information about the last open URL this way:

last_response = br.response() # This is returned by br.open(...) too
http_header_dict = last_response.info().dict
html_string_list = last_response.readlines()
html_data = "".join(html_string_list)

Even if it's a JavaScript, you can get the redirection URL by locating it in the html_data, using html_data.find(), regular expressions, BeautifulSoup, etc..

PEP8 note: Avoid using isolated "l" (lower "L") as variable, it might be mistakenly seem as "one" or "I" (upper "i") depending on the used font and context. You should use "L" or other name instead.