Problems with web site scraping using zombie.js

Niklas B. picture Niklas B. · Sep 7, 2011 · Viewed 8.7k times · Source

I need to do some web scraping. After playing around with different web testing framework, of which most where either too slow (Selenium) or too buggy for my needs (env.js), I decided that zombie.js looks most promising, as it uses a solid set of libraries for HTML parsing and DOM manipulation. However, it seems to me like it doesn't even support basic event-based Javascript code like in the following web page:

<html>
  <head>
    <title>test</title>
    <script type="text/javascript">

      console.log("test script executing...");
      console.log("registering callback for event DOMContentLoaded on " + document);

      document.addEventListener('DOMContentLoaded', function(){
        console.log("DOMContentLoaded triggered");
      }, false);

      function loaded() {
        console.log("onload triggered");
      }

    </script>
  </head>

  <body onload="loaded();">
    <h1>Test</h1>
  </body>
</html>

I then decided to trigger those events manually like this:

zombie = require("zombie");

zombie.visit("http://localhost:4567/", { debug: true }, function (err, browser, status) {

  doc = browser.document;
  console.log("firing DOMContentLoaded on " + doc);
  browser.fire("DOMContentLoaded", doc, function (err, browser, status) {

    body = browser.querySelector("body");
    console.log("firing load on " + body);
    browser.fire("load", body, function (err, browser, status) {

      console.log(browser.html());

    });
  });

});

Which works for this particular test page. My problem is a more general one, though: I want to be able to scrape more complex, AJAX-based sites like a friends list on Facebook (something like http://www.facebook.com/profile.php?id=100000028174850&sk=friends&v=friends). It is no problem to log into the site using zombie, but some content like those lists seem to be completely loaded dynamically using AJAX, and I don't know how to trigger the event handlers that initiate the loading.

There are several questions I have regarding this problem:

  • Has somebody already implemented a similarly complex scraper without using a browser remote-controlling solution like Selenium?
  • Is there some reference on the loading process of a complex Javascript-based page?
  • Can somebody provide advice on how to debug a real browser to see what I might need to execute to trigger the Facebook event handlers?
  • Any other ideas about this topic?

Again, please do not point me to solutions involving controlling a real browser like Selenium, as I know about those. What is however welcome are suggestions for a real in-memory renderer like WebKit accessible from the Ruby scripting language, but preferrably with the possibility to set cookies and preferrably also load raw HTML instead of triggering real HTTP requests.

Answer

jches picture jches · Sep 7, 2011

For purposes of data extraction, running a "headless browser" and triggering javascript events manually is not going to be the easiest thing to do. While not impossible, there are simpler ways to do it.

Most sites, even AJAX-heavy ones, can be scraped without executing a single line of their Javascript code. In fact it's usually easier than trying to figure out a site's Javascript code, which is often obfuscated, minified, and difficult to debug. If you have a solid understanding of HTTP you will understand why: (almost) all interactions with the server are encoded as HTTP requests, so whether they are initiated by Javascript, or the user clicking a link, or custom code in a bot program, there's no difference to the server. (I say almost because when Flash or applets get involved there's no telling what data is flying where; they can be application-specific. But anything done in Javascript will go over HTTP.)

That being said, it is possible to mimic a user on any website using custom software. First you have to be able to see the raw HTTP requests being sent to the server. You can use a proxy server to record requests made by a real browser to the target website. There are many, many tools you can use for this: Charles or Fiddler are handy, most dedicated screen-scraper tools have a basic proxy built-in, The Firebug extension for Firefox and Chrome have similar tools for viewing AJAX requests...you get the idea.

Once you can see the HTTP requests that are made as a result of a particular action on the website, it is easy to write a program to mimic these requests; just send the same requests to the server and it will treat your program just like a browser in which a particular action has been performed.

There are differing libraries for different languages offering different capabilities. For ruby, I have seen a lot of people using mechanize for ruby.

If data extraction is your only goal, then you'll almost always be able to get what you need by mimicking HTTP requests this way. No Javascript required.

Note - Since you mentioned Facebook, I should mention that scraping Facebook specifically can be exceptionally difficult (although not impossible), because Facebook has measures in place to detect automated access (they use more than just captchas); they will disable an account if they see suspicious activity coming from it. It is, after all, against their terms of service (section 3.2).