How can I get the ultimate URL without fetching the pages using Perl and LWP?

planetp picture planetp · Mar 18, 2010 · Viewed 8.3k times · Source

I'm doing some web scraping using Perl's LWP. I need to process a set of URLs, some of which may redirect (1 or more times).

How can I get ultimate URL with all redirects resolved, using HEAD method?

Answer

Tony Miller picture Tony Miller · Mar 18, 2010

If you use the fully featured version of LWP::UserAgent, then the response that is returned is an instance of HTTP::Response which in turn has as an attribute an HTTP::Request. Note that this is NOT necessarily the same HTTP::Request that you created with the original URL in your set of URLs, as described in the HTTP::Response documentation for the method to retrieve the request instance within the response instance:

$r->request( $request )

This is used to get/set the request attribute. The request attribute is a reference to the the request that caused this response. It does not have to be the same request passed to the $ua->request() method, because there might have been redirects and authorization retries in between.

Once you have the request object, you can use the uri method to get the URI. If redirects were used, the URI is the result of following the chain of redirects.

Here's a Perl script, tested and verified, that gives you the skeleton of what you need:

#!/usr/bin/perl

use strict;
use warnings;

use LWP::UserAgent;

my $ua;  # Instance of LWP::UserAgent
my $req; # Instance of (original) request
my $res; # Instance of HTTP::Response returned via request method

$ua = LWP::UserAgent->new;
$ua->agent("$0/0.1 " . $ua->agent);

$req = HTTP::Request->new(HEAD => 'http://www.ecu.edu/wllc');
$req->header('Accept' => 'text/html');

$res = $ua->request($req);

if ($res->is_success) {
    # Using double method invocation, prob. want to do testing of
    # whether res is defined.
    # This is inline version of
    # my $finalrequest = $res->request(); 
    # print "Final URL = " . $finalrequest->url() . "\n";
    print "Final URI = " . $res->request()->uri() . "\n";
} else {
    print "Error: " . $res->status_line . "\n";
}