how to scrape links with phantomjs

john mangual picture john mangual · Dec 19, 2012 · Viewed 26.9k times · Source

Can PhantomJS be used an an alternative to BeautifulSoup?

I am trying to search on Etsy and visit all the links in term. In Python, I know how to do this (with BeautifulSoup) but today I want to see if I can do the same with PhantomJS. I'm not getting very far.

This script should search "hello kitty" on Etsy and return all the of products <a class="listing-thumb" href=...></a> and print them in the console. Ideally I'd visit them later on and get the information I need. Right now it just freezes. Any ideas?

var page = require('webpage').create();
var url = 'http://www.etsy.com/search?q=hello%20kitty';

page.open(url, function(status){
    // list all the a.href links in the hello kitty etsy page
    var link = page.evaluate(function() {
        return document.querySelectorAll('a.listing-thumb');
    });
    for(var i = 0; i < link.length; i++){ console.log(link[i].href); }
    phantom.exit();
});

I have toyed with using CasperJS, which may be better designed for this.

Answer

NiKo picture NiKo · Dec 19, 2012

PhantomJS evaluate() cannot serialize and return complex objects like HTMLElements or NodeLists, so you have to map them to serializable things before:

var page = require('webpage').create();
var url = 'http://www.etsy.com/search?q=hello%20kitty';

page.open(url, function(status) {
    // list all the a.href links in the hello kitty etsy page
    var links = page.evaluate(function() {
        return [].map.call(document.querySelectorAll('a.listing-thumb'), function(link) {
            return link.getAttribute('href');
        });
    });
    console.log(links.join('\n'));
    phantom.exit();
});

Note: here we use [].map.call() in order to treat a NodeList as a standard Array.