Scrape a webpage and navigate by clicking buttons

user2129794 picture user2129794 · Aug 10, 2013 · Viewed 30.4k times · Source

I want to perform following actions at the server side:

1) Scrape a webpage
2) Simulate a click on that page and then navigate to the new page.
3) Scrape the new page
4) Simulate some button clicks on the new page
5) Sending the data back to the client via json or something

I am thinking of using it with Node.js.

But am confused as to which module should i use
a) Zombie
b) Node.io
c) Phantomjs
d) JSDOM
e) Anything else

I have installed node,io but am not able to run it via command prompt.

PS: I am working in windows 2008 server

Answer

danielepolencic picture danielepolencic · Aug 11, 2013

Zombie.js and Node.io run on JSDOM, hence your options are either going with JSDOM (or any equivalent wrapper), a headless browser (PhantomJS, SlimerJS) or Cheerio.

  • JSDOM is fairly slow because it has to recreate DOM and CSSOM in Node.js.
  • PhantomJS/SlimerJS are proper headless browsers, thus performances are ok and those are also very reliable.
  • Cheerio is a lightweight alternative to JSDOM. It doesn't recreate the entire page in Node.js (it just downloads and parses the DOM - no javascript is executed). Therefore you can't really click on buttons/links, but it's very fast to scrape webpages.

Given your requirements, I'd probably go with something like a headless browser. In particular, I'd choose CasperJS because it has a nice and expressive API, it's fast and reliable (it doesn't need to reinvent the wheel on how to parse and render the dom or css like JSDOM does) and it's very easy to interact with elements such as buttons and links.

Your workflow in CasperJS should look more or less like this:

casper.start();

casper
  .then(function(){
    console.log("Start:");
  })
  .thenOpen("https://www.domain.com/page1")
  .then(function(){
    // scrape something
    this.echo(this.getHTML('h1#foobar'));
  })
  .thenClick("#button1")
  .then(function(){
    // scrape something else
    this.echo(this.getHTML('h2#foobar'));
  })
  .thenClick("#button2")
  thenOpen("http://myserver.com", {
    method: "post",
    data: {
        my: 'data',
    }
  }, function() {
      this.echo("data sent back to the server")
  });

casper.run();