This is my first post on SO, so be gentle. I'm not even sure if this belongs here, but here goes.
I want to access some information on one of my personal accounts. The website is poorly written and requires me to manually input the date I want the information for. It is truly a pain. I have been looking for an excuse to learn more Perl so I thought this would be a great opportunity. My plan was to write a Perl script that would login to my account and query the information for me. However, I got stuck pretty quickly.
my $ua = LWP::UserAgent->new;
my $url = url 'https://account.web.site';
my $res = $ua->request(GET $url);
The resulting web page basically says that my web browser is not supported. I tried a number of different values for
$ua->agent("");
but nothing nothings seems to work. Google-ing around suggests this method, but it also says that perl is used for malicious reasons on web sites. Do web sites block this method? Is what I am trying to do even possible? Is there a different language that would be more appropriate? Is what I'm trying to do even legal or even a good idea? Maybe I should just abandon my efforts.
Note that to prevent giving away any private information, the code I wrote here is not the exact code I am using. I hope that was pretty obvious, though.
EDIT: In FireFox, I disabled JavaScript and CSS. I logged in just fine without the "Incompatible browser" error. It doesn't seem to be JavaScript issue.
We have to make one assumption, the web-server will return the same output if given the same input. With this assumption we inescapably come to the conclusion we're not giving it the same input. There are two browsers, or http clients in this scenario: the one that is giving you the result you want (ex., Firefox, IE, Chrome, or Safari), and the one that is not giving you the result you want (ex., LWP, wget, or cURL).
Before, continuing firstly make sure the simple UserAgents are the same, you can do this by browsing to whatsmyuseragent.com and setting the UserAgent string in the header of the other browser to whatever that website returns. You can also use Firefox's Web Developer's Toolbar to disable CSS, and JavaScript, Java, and meta-redirects: this will help you track down the problem by killing off the really simple stuff.
Now with Firefox you can use FireBug to analyze the REQUEST
that is sent. You can do this under the NET
tab in FireBug, different browsers should have tools that can do what FireBug does with FireFox; however, if you don't know the tool in question you can still use tshark or wireshark as described below. It is important to note that tshark and wireshark will always be more accurate because they work at a lower level which at least in my experience leaves less room for error. For example, you'll see things like meta-redirects the browser is doing which sometimes FireBug can lose track of.
After you understand the first web-request that works, do your best to set the second web-request to that of the first. By this I mean setting the request-headers properly and other request elements. If this still doesn't work you have to know what the second browser is doing to see what is wrong.
In order to troubleshoot this, we must have a total understanding of the requests from both browsers. The second browser is usually tricker, these are often libraries and non-interactive command line browsers that lack the ability to check the request. If they have the ability to dump the request you might still opt to simply check them anyway. To do this I suggest the wireshark and tshark suite. Immediately, you should be warned that because these operate below the browser. By default, you'll see the actual network (IP) packets, and data-link frames. You can filter out what you need specifically with a command like this.
sudo tshark -i <interface> -f tcp -R "http.request" -V |
perl -ne'print if /^Hypertext/../^Frame/'
This will capture all of the TCP packets, display-filter only the http.requests
, then perl filter for only layer 4 HTTP stuff. You might want to add to the display filter to only grab a single web server too -R "http.request and http.host == ''"
You're going to want to check everything to see if the two requests are in line, cookies, GET url, user-agent, etc. Make sure the site doesn't do something goofy.
Updated Jan 23 2010: Based on the new information I would suggest setting Accept
, and Accept-Language
, Accept-Charset
and Accept-Encoding
. You can do that with through $ua->default_headers()
. If what you demand is a lot more functionality out of your useragent, you can always subclass it. I took this aproach for my GData API, you can find my example on of a UserAgent subclass on github.