How do you spider with PhantomJS

John Murch picture John Murch · Nov 16, 2011 · Viewed 14.9k times · Source

I am trying to leverage PhantomJS and spider an entire domain. I want to start at the root domain e.g. www.domain.com - pull all links (a.href) and then have a que of fetching each new links and adding new links to the que if they haven't been crawled or in que.

Ideas, Help?

Thanks in advance!

Answer

nrabinowitz picture nrabinowitz · Dec 6, 2011

You might be interested in checking out Pjscrape (disclaimer: this is my project), an Open Source scraping library built on top of PhantomJS. It has built-in support for spidering pages and scraping information from each as it progresses. You could spider an entire site, looking at every anchor link, with a short script like this:

pjs.addSuite({
    url: 'http://www.example.com/your_start_page.html',
    moreUrls: function() {
        // get all URLs from anchor links,
        // restricted to the current domain by default
        return _pjs.getAnchorUrls('a');
    },
    scraper: function() {
        // scrapers can use jQuery
        return $('h1').first().text();
    }
});

By default this will skip pages already spidered and only follow links on the current domain, though these can both be changed in your settings.