How do you spider with PhantomJS

Question 1

How do you spider with PhantomJS

web-crawler phantomjs

John Murch · Nov 16, 2011 · Viewed 14.9k times · Source

Answer

Answer

You might be interested in checking out Pjscrape (disclaimer: this is my project), an Open Source scraping library built on top of PhantomJS. It has built-in support for spidering pages and scraping information from each as it progresses. You could spider an entire site, looking at every anchor link, with a short script like this:

pjs.addSuite({
    url: 'http://www.example.com/your_start_page.html',
    moreUrls: function() {
        // get all URLs from anchor links,
        // restricted to the current domain by default
        return _pjs.getAnchorUrls('a');
    },
    scraper: function() {
        // scrapers can use jQuery
        return $('h1').first().text();
    }
});

By default this will skip pages already spidered and only follow links on the current domain, though these can both be changed in your settings.

Question 2

I am trying to leverage PhantomJS and spider an entire domain. I want to start at the root domain e.g. www.domain.com - pull all links (a.href) and then have a que of fetching each new links and adding new links to the que if they haven't been crawled or in que.

Ideas, Help?

Thanks in advance!

How do you spider with PhantomJS

Answer

Related questions