How to manage a 'pool' of PhantomJS instances

node.js web-scraping phantomjs jsdom

Trindaz · Apr 1, 2012 · Viewed 23.6k times · Source

I'm planning a webservice for my own use internally that takes one argument, a URL, and returns html representing the resolved DOM from that URL. By resolved I mean that the webservice will firstly get the page at that URL, then use PhantomJS to 'render' the page, and then return the resulting source after all DHTML, AJAX calls etc are executed. However launching phantom on a per-request basis (which I'm doing now) is way too sluggish. I would rather have a pool of PhantomJS instances with one always available to serve the latest call to my webservice.

Has any work been done on this kind of thing before? I'd rather base this webservice on the work of others than write a pool manager / http proxy server for myself from scratch.

More Context: I've listed the 2 similar projects that I've seen so far below and why I've avoided each one, resulting in this question about managing a pool of PhantomJS instances instead.

jsdom - from what I've seen it has great functionality for executing scripts on a page, but it doesn't attempt to replicate browser behaviour, so if I were use it as a general purpose "DOM resolver" there'd end up being a lot of extra coding to handle all kinds of edges cases, event calling, etc. The first example I saw was having to manually call the onload() function of the body tag for a test app I set up using node. It seemed like the beginning of a deep rabbit hole.

Selenium - It just has soo many more moving parts, so setting up a pool to manage long lived browser instances will just be more complicated than using PhantomJS. I don't need any of it's macro recording / scripting benefits. I just want a webservice that is as performant at getting a webpage and resolving it's DOM as if I were browsing to that URL with a browser (or even faster if I can make it ignore images etc.)

Answer

I setup a PhantomJs Cloud Service, and it pretty much does what you are asking. It took me about 5 weeks of work implement.

The biggest problem you'll run into is the known-issue of memory leaks in PhantomJs. The way I worked around this is to cycle my instances every 50 calls.

The second biggest problem you'll run into is per-page processing is very cpu and memory intensive, so you'll only be able to run 4 or so instances per CPU.

The third biggest problem you'll run into is that PhantomJs is pretty wacky with page-finish events and redirects. You'll be informed that your page is finished rendering before it actually is. There are a number of ways to deal with this, but nothing 'standard' unfortunately.

The fourth biggest problem you'll have to deal with is interop between nodejs and phantomjs thankfully there are a lot of npm packages that deal with this issue to choose from.

So I know I'm biased (as I wrote the solution I'm going to suggest) but I suggest you check out PhantomJsCloud.com which is free for light usage.

Jan 2015 update: Another (5th?) big problem I ran into is how to send the request/response from the manager/load-balancer. Originally I was using PhantomJS's built-in HTTP server, but kept running into it's limitations, especially regarding maximum response-size. I ended up writing the request/response to the local file-system as the lines of communication. * Total time spent on implementation of the service represents perhaps 20 man-weeks issues is perhaps 1000 hours of work. * and FYI I am doing a complete rewrite for the next version.... (in-progress)

How to manage a 'pool' of PhantomJS instances

Answer

Related questions