Can I get the original page source (vs current DOM) with phantomjs/casperjs?

supercoco picture supercoco · Jun 5, 2014 · Viewed 12.6k times · Source

I am trying to get the original source for a particular web page.

The page executes some scripts that modify the DOM as soon as it loads. I would like to get the source before any script or user changes any object in the document.

With Chrome or Firefox (and probably most browsers) I can either look at the DOM (debug utility F12) or look at the original source (right-click, view source). The latter is what I want to accomplish.

Is it possible to do this with phantomjs/casperjs?

Before getting to the page I have to log in. This is working fine with casperjs. If I browse to the page and render the results I know I am on the right page.

casper.thenOpen('http://'+customUrl, function(response) {
    this.page.render('example.png'); // *** Renders correct page (current DOM) ***
    console.log(this.page.content); // *** Gets current DOM ***
    casper.download('view-source:'+customUrl, 'b.html', 'GET'); // *** Blank page ***
    console.log(this.getHTML()); // *** Gets current DOM ***
    this.debugPage(); // *** Gets current DOM ***
    utils.dump(response); // *** No BODY ***
    casper.download('http://'+customUrl, 'a.html', 'GET');  // *** Not logged in ?! ***
});

I've tried this.download(url, 'a.html') but it doesn't seem to share the same context since it returns HTML as if I was not logged in, even if I run with cookies casperjs test.casper.js --cookies-file=cookies.txt.

I believe I should keep analyzing this option.


I have also tried casper.open('view-source:url') instead of casper.open('http://url') but it seems it doesn't recognize the url since I just get a blank page.

I have looked at the raw HTTP Response I get from the server with a utility I have and the body of this message (which is HTML) is what I need but when the page loads in the browser the DOM has already been modified.

I tried:

casper.thenOpen('http://'+url, function(response) {
    ...
}

But the response object only contains the headers and some other information but not the body.


I also tried with the event onResourceRequested.

The idea is to abort the download of any resource needed by a specific web page (the referer).

onResourceRequested: function(casperObj, requestData, networkRequest) {
for (var i=0; i < requestData.headers.length; i++) {
    var obj = requestData.headers[i];
    if (obj.name === "Referer" && obj.value === 'http://'+customUrl) {
        networkRequest.abort();
        break;
    }
}

Unfortunately the script that modifies the DOM initially seems to be inline the main HTML page (or this code is not doing what I would like it to do).


¿Any ideas?

Here is the full code:

phantom.casperTest = true;
phantom.cookiesEnabled = true;

var utils = require('utils');
var casper = require('casper').create({
    clientScripts:  [],
    pageSettings: {
        loadImages:  false,
        loadPlugins: false,
        javascriptEnabled: true,
        webSecurityEnabled: false
    },
    logLevel: "error",
    verbose: true
});

casper.userAgent('Mozilla/5.0 (Macintosh; Intel Mac OS X)');

casper.start('http://www.xxxxxxx.xxx/login');

casper.waitForSelector('input#login',
    function() {
        this.evaluate(function(customLogin, customPassword) {
            document.getElementById("login").value = customLogin;
            document.getElementById("password").value = customPassword;
            document.getElementById("button").click();
        }, {
            "customLogin": customLogin,
            "customPassword": customPassword
        });
    },
    function() {
        console.log('Can't login.');
    },
    15000
);

casper.waitForSelector('div#home',
    function() {
        console.log('Login successfull.');
    },
    function() {
        console.log('Login failed.');
    },
    15000
);

casper.thenOpen('http://'+customUrl, function(response) {
    this.page.render('example.png'); // *** Renders correct page (current DOM) ***
    console.log(this.page.content); // *** Gets current DOM ***
    casper.download('view-source:'+customUrl, 'b.html', 'GET'); // *** Blank page ***
    console.log(this.getHTML()); // *** Gets current DOM ***
    this.debugPage(); // *** Gets current DOM ***
    utils.dump(response); // *** No BODY ***
    casper.download('http://'+customUrl, 'a.html', 'GET');  // *** Not logged in ?! ***
});

Answer

Fanch picture Fanch · Jun 6, 2014

Hum, did you try using some events? For example :

casper.on('load.started', function(resource) {
    casper.echo(casper.getPageContent());
});

I think it won't work, try it anyway.

The problem is : you can't do it in a normal casperJS step because the scripts on your page are already executed. It could work if we could bind the on-DOM-Ready event, or have a specific casper event like that. Problem : the page must be loaded to send some js from Casper to the DOM environment. So binding onready isn't possible (I don't see how). I think with phantom we can scrape DATA after the load event, so only when the page is rendered.

So if it's not possible to hack it with the events and maybe some delay, your only solution is to block the scripts which modify your DOM.

There is still the phantomJS option, you use it : in casper :

casper.pageSettings.javascriptEnabled = false;

The problem is you need the js enabled to get back the data, so it can't work... :p Yeah useless comment ! :)

Otherwise you have to block the wanted ressource/script which modify the DOM using events.

Or you could use the resource.received event to scrape the data wanted before the specific resources modifing DOM appear.

In fact I don't think it's possible because if you create a step which get back some data from page just before specific ressources appear, the time your step is executed, the ressources will have load. It would be necessary to freeze the following ressources while your step is scraping the data.

Don't know how to do it though, but these events could help you :

casper.on('resource.requested', function(request) {
    console.log(" request " + request.url);
});

casper.on('resource.received', function(resource) {
    console.log(resource.url);
});

casper.on('resource.error',function (request) {
    this.echo('[res : id and url + error description] <-- ' + request.id + ' ' + request.url + ' ' + request.errorString);
});

See also How do you Disable css in CasperJS?. The solution which would work : you identify the scripts and block them. But if you need them, well I don't know, it's a good question. Maybe we could defer the execution of a specific script. I don't think Casper and phantom easily permit that.The only useful option is abort(), give us this option : timeout("time -> ms") !

onResourceRequested

Here a similar question : Injecting script before other