I'm trying to automate the download of the Panel Study of Income Dynamics files available on this web page using R. Clicking on any of those files takes the user through to this login/authentication page. After authentication, it's easy to download the files with your web browser. Unfortunately, the httr
code below does not appear to be maintaining the authentication. I have tried inspecting the Headers
in Chrome for the Login.aspx page (as described here), but it doesn't appear to maintain the authentication even when I believe I'm passing in all the correct values. I don't care if it's done with httr
or RCurl
or something else, I'd just like something that works inside R so I don't need to have users of this script have to download the files manually or with some completely separate program. One of my attempts at this is below, but it doesn't work. Any help would be appreciated. Thanks!! :D
require(httr)
values <-
list(
"ctl00$ContentPlaceHolder3$Login1$UserName" = "[email protected]" ,
"ctl00$ContentPlaceHolder3$Login1$Password" = "somepassword" ,
"ctl00$ContentPlaceHolder3$Login1$LoginButton" = "Log In" ,
"_LASTFOCUS" = "" ,
"_EVENTTARGET" = "" ,
"_EVENTARGUMENT" = ""
)
POST( "http://simba.isr.umich.edu/u/Login.aspx?redir=http%3a%2f%2fsimba.isr.umich.edu%2fZips%2fZipMain.aspx" , body = values )
resp <- GET( "http://simba.isr.umich.edu/Zips/GetFile.aspx" , query = list( file = "1053" ) )
Beside storing the cookie after authentication (see my above comment) there was another problematic point in your solution: the ASP.net site sets a VIEWSTATE
key-value pair in the cookie which is to be reserved in your queries - if you check, you could not even login in your example (the result of the POST
command holds info about how to login, just check it out).
An outline of a possible solution:
Load RCurl
package:
> library(RCurl)
Set some handy curl
options:
> curl = getCurlHandle()
> curlSetOpt(cookiejar = 'cookies.txt', followlocation = TRUE, autoreferer = TRUE, curl = curl)
Load the page for the first time to capture VIEWSTATE
:
> html <- getURL('http://simba.isr.umich.edu/u/Login.aspx', curl = curl)
Extract VIEWSTATE
with a regular expression or any other tool:
> viewstate <- as.character(sub('.*id="__VIEWSTATE" value="([0-9a-zA-Z+/=]*).*', '\\1', html))
Set the parameters as your username, password and the VIEWSTATE
:
> params <- list(
'ctl00$ContentPlaceHolder3$Login1$UserName' = '<USERNAME>',
'ctl00$ContentPlaceHolder3$Login1$Password' = '<PASSWORD>',
'ctl00$ContentPlaceHolder3$Login1$LoginButton' = 'Log In',
'__VIEWSTATE' = viewstate
)
Log in at last:
> html = postForm('http://simba.isr.umich.edu/u/Login.aspx', .params = params, curl = curl)
Congrats, now you are logged in and curl
holds the cookie verifying that!
Verify if you are logged in:
> grepl('Logout', html)
[1] TRUE
So you can go ahead and download any file - just be sure to pass curl = curl
in all your queries.