Getting DOM node text with Puppeteer and headless Chrome

Question 1

Getting DOM node text with Puppeteer and headless Chrome

node.js google-chrome-headless puppeteer

Don Kirkby · Sep 13, 2017 · Viewed 20.6k times · Source

Answer

Answer

I found three solutions to this problem, depending on how complicated your extraction is. The simplest option is a related function that I hadn't noticed: page.$eval(). It basically does what I was trying to do: combines page.$() and page.evaluate(). Here's an example that works:

const puppeteer = require('puppeteer');

puppeteer.launch().then(function(browser) {
    browser.newPage().then(function(page) {
        page.goto('https://stackoverflow.com/users/4794').then(function() {
            page.$eval('h2.user-card-name', function(heading) {
                return heading.innerText;
            }).then(function(result) {
                console.info(result);
                browser.close();
            });
        });
    });
});

That gives me the expected result:

$ node get_user.js 
Don Kirkby top 2% overall

I wanted to extract something more complicated, but I finally realized that the evaluation function is running in the context of the page. That means you can use any tools that are loaded in the page, and then just send strings and numbers back and forth. In this example, I use jQuery in a string to extract what I want:

const puppeteer = require('puppeteer');

puppeteer.launch().then(function(browser) {
    browser.newPage().then(function(page) {
        page.goto('https://stackoverflow.com/users/4794').then(function() {
            page.evaluate("$('h2.user-card-name').text()").then(function(result) {
                console.info(result);
                browser.close();
            });
        });
    });
});

That gives me a result with the whitespace intact:

$ node get_user.js 

                            Don Kirkby

                                top 2% overall

In my real script, I want to extract the text of several nodes, so I need a function instead of a simple string:

const puppeteer = require('puppeteer');

puppeteer.launch().then(function(browser) {
    browser.newPage().then(function(page) {
        page.goto('https://stackoverflow.com/users/4794').then(function() {
            page.evaluate(function() {
                return $('h2.user-card-name').text();
            }).then(function(result) {
                console.info(result);
                browser.close();
            });
        });
    });
});

That gives the exact same result. Now I need to add error handling, and maybe reduce the indentation levels.

Question 2

I'm trying to use headless Chrome and Puppeteer to run our Javascript tests, but I can't extract the results from the page. Based on this answer, it looks like I should use page.evaluate(). That section even has an example that looks like what I need.

const bodyHandle = await page.$('body');
const html = await page.evaluate(body => body.innerHTML, bodyHandle);
await bodyHandle.dispose();

As a full example, I tried to convert that to a script that will extract my name from my user profile on Stack Overflow. Our project is using Node 6, so I converted the await expressions to use .then().

const puppeteer = require('puppeteer');

puppeteer.launch().then(function(browser) {
    browser.newPage().then(function(page) {
        page.goto('https://stackoverflow.com/users/4794').then(function() {
            page.$('h2.user-card-name').then(function(heading_handle) {
                page.evaluate(function(heading) {
                    return heading.innerText;
                }, heading_handle).then(function(result) {
                    console.info(result);
                    browser.close();
                }, function(error) {
                    console.error(error);
                    browser.close();
                });
            });
        });
    });
});

When I run that, I get this error:

$ node get_user.js 
TypeError: Converting circular structure to JSON
    at Object.stringify (native)
    at args.map.x (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/helper.js:30:43)
    at Array.map (native)
    at Function.evaluationString (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/helper.js:30:29)
    at Frame.<anonymous> (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/FrameManager.js:376:31)
    at next (native)
    at step (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/FrameManager.js:355:24)
    at Promise (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/FrameManager.js:373:12)
    at fn (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/FrameManager.js:351:10)
    at Frame._rawEvaluate (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/FrameManager.js:375:3)

The problem seems to be with serializing the input parameter to page.evaluate(). I can pass in strings and numbers, but not element handles. Is the example wrong, or is it a problem with Node 6? How can I extract the text of a DOM node?

Getting DOM node text with Puppeteer and headless Chrome

Answer

Related questions