How can I extract URL and link text from HTML in Perl?

anon picture anon · Oct 31, 2008 · Viewed 31.5k times · Source

I previously asked how to do this in Groovy. However, now I'm rewriting my app in Perl because of all the CPAN libraries.

If the page contained these links:

<a href="http://www.google.com">Google</a>

<a href="http://www.apple.com">Apple</a>

The output would be:

Google, http://www.google.com
Apple, http://www.apple.com

What is the best way to do this in Perl?

Answer

Andy Lester picture Andy Lester · Oct 31, 2008

Please look at using the WWW::Mechanize module for this. It will fetch your web pages for you, and then give you easy-to-work with lists of URLs.

my $mech = WWW::Mechanize->new();
$mech->get( $some_url );
my @links = $mech->links();
for my $link ( @links ) {
    printf "%s, %s\n", $link->text, $link->url;
}

Pretty simple, and if you're looking to navigate to other URLs on that page, it's even simpler.

Mech is basically a browser in an object.