Performant parsing of HTML pages with Node.js and XPath

polkovnikov.ph picture polkovnikov.ph · Sep 9, 2014 · Viewed 20.3k times · Source

I'm into some web scraping with Node.js. I'd like to use XPath as I can generate it semi-automatically with several sorts of GUI. The problem is that I cannot find a way to do this effectively.

  1. jsdom is extremely slow. It's parsing 500KiB file in a minute or so with full CPU load and a heavy memory footprint.
  2. Popular libraries for HTML parsing (e.g. cheerio) neither support XPath, nor expose W3C-compliant DOM.
  3. Effective HTML parsing is, obviously, implemented in WebKit, so using phantom or casper would be an option, but those require to be running in a special way, not just node <script>. I cannot rely on the risk implied by this change. For example, it's much more difficult to find how to run node-inspector with phantom.
  4. Spooky is an option, but it's buggy enough, so that it didn't run at all on my machine.

What's the right way to parse an HTML page with XPath then?

Answer

pda picture pda · Sep 22, 2014

You can do so in several steps.

  1. Parse HTML with parse5. The bad part is that the result is not DOM. Though it's fast enough and W3C-compiant.
  2. Serialize it to XHTML with xmlserializer that accepts DOM-like structures of parse5 as input.
  3. Parse that XHTML again with xmldom. Now you finally have that DOM.
  4. The xpath library builds upon xmldom, allowing you to run XPath queries. Be aware that XHTML has its own namespace, and queries like //a won't work.

Finally you get something like this.

const fs = require('mz/fs');
const xpath = require('xpath');
const parse5 = require('parse5');
const xmlser = require('xmlserializer');
const dom = require('xmldom').DOMParser;

(async () => {
    const html = await fs.readFile('./test.htm');
    const document = parse5.parse(html.toString());
    const xhtml = xmlser.serializeToString(document);
    const doc = new dom().parseFromString(xhtml);
    const select = xpath.useNamespaces({"x": "http://www.w3.org/1999/xhtml"});
    const nodes = select("//x:a/@href", doc);
    console.log(nodes);
})();