Render JavaScript and HTML in (any) Java Program (Access rendered DOM Tree)?

tim picture tim · Jan 29, 2010 · Viewed 12.1k times · Source

What are the best Java libraries to "fully download any webpage and render the built-in JavaScript(s) and then access the rendered webpage (that is the DOM-Tree !) programmatically and get the DOM Tree as an "HTML-Source"?

(Something similarly what firebug does in the end, it renders the page and I get access to the fully rendered DOM Tree, as the page looks like in the browser! In contrast, if I click "show source" I only get the JavaScript source code. This is not what I want. I need to have access to the rendered page...)

(With rendering I mean only rendering the DOM Tree not a visual rendering...)

This does not have to be one single library, it's ok to have several libraries that can accomplish this together (one will download, one render...), but due to the dynamic nature of JavaScript most likely the JavaScript library will also have to have some kind of downloader to fully render any asynchronous JS...

Background:
In the "good old days" HttpClient (Apache Library) was everything required to build your own very simple crawler. (A lot of cralwers like Nutch or Heretrix are still built around this core princible, mainly focussing on Standard HTML parsing, so I can't learn from them) My problem is that I need to crawl some websites that rely heavily on JavaScript and that I can't parse with HttpClient as I defenitely need to execute the JavaScripts before...

Answer

Erica picture Erica · Feb 1, 2012

This is a bit outside of the box, but if you are planning on running your code in a server where you have complete control over your environment, it might work...

Install Firefox (or XulRunner, if you want to keep things lightweight) on your machine.

Using the Firefox plugins system, write a small plugin which takes loads a given URL, waits a few seconds, then copies the page's DOM into a String.

From this plugin, use the Java LiveConnect API (see http://jdk6.java.net/plugin2/liveconnect/ and https://developer.mozilla.org/en/LiveConnect ) to push that string across to a public static function in some embedded Java code, which can either do the required processing itself or farm it out to some more complicated code.

Benefits: You are using a browser that most application developers target, so the observed behavior should be comparable. You can also upgrade the browser along the normal upgrade path, so your library won't become out-of-date as HTML standards change.

Disadvantages: You will need to have permission to start a non-headless application on your server. You'll also have the complexity of inter-process communication to worry about.

I have used the plugin API to call Java before, and it's quite achievable. If you'd like some sample code, you should take a look at the XQuery plugin - it loads XQuery code from the DOM, passes it across to the Java Saxon library for processing, then pushes the result back into the browser. There are some details about it here:

https://developer.mozilla.org/en/XQuery