I am trying to leverage PhantomJS and spider an entire domain. I want to start at the root domain e.g. www.domain.com - pull all links (a.href) and then have a que of fetching each new links and adding new links to the que if they haven't been crawled or in que.
Ideas, Help?
Thanks in advance!
You might be interested in checking out Pjscrape (disclaimer: this is my project), an Open Source scraping library built on top of PhantomJS. It has built-in support for spidering pages and scraping information from each as it progresses. You could spider an entire site, looking at every anchor link, with a short script like this:
pjs.addSuite({
url: 'http://www.example.com/your_start_page.html',
moreUrls: function() {
// get all URLs from anchor links,
// restricted to the current domain by default
return _pjs.getAnchorUrls('a');
},
scraper: function() {
// scrapers can use jQuery
return $('h1').first().text();
}
});
By default this will skip pages already spidered and only follow links on the current domain, though these can both be changed in your settings.