Using CefSharp.Offscreen to retrieve a web page that requires Javascript to render

Robert Harvey picture Robert Harvey · Feb 18, 2016 · Viewed 15.3k times · Source

I have what is hopefully a simple task, but it's going to take someone that's versed in CefSharp to solve it.

I have an url that I want to retrieve the HTML from. The problem is this particular url doesn't actually distribute the page on a GET. Instead, it pushes a mound of Javascript to the browser, which then executes and produces the actual rendered page. This means that the usual approaches involving HttpWebRequest and HttpWebResponse aren't going to work.

I've looked at a number of different "headless" options, and the one that I think best meets my needs for a number of reasons is CefSharp.Offscreen. But I'm at a loss as to how this thing works. I see that there are several events that can be subscribed to, and some configuration options, but I don't need anything like an embedded browser.

All I really need is a way to do something like this (pseudocode):

string html = CefSharp.Get(url);

I don't have a problem subscribing to events, if that's what's needed to wait for the Javascript to execute and produce the rendered page.

Answer

samuel guedon picture samuel guedon · Aug 14, 2018

I know I am doing some archaeology reviving a 2yo post, but a detailed answered may be of use for someone else.

So yes, Cefsharp.Offscreen is fit to the task.

Here under is a class which will handle all the browser activity.

using System;
using System.IO;
using System.Threading;
using CefSharp;
using CefSharp.OffScreen;

namespace [whatever]
{
    public class Browser
    {

        /// <summary>
        /// The browser page
        /// </summary>
        public ChromiumWebBrowser Page { get; private set; }
        /// <summary>
        /// The request context
        /// </summary>
        public RequestContext RequestContext { get; private set; }

        // chromium does not manage timeouts, so we'll implement one
        private ManualResetEvent manualResetEvent = new ManualResetEvent(false);

        public Browser()
        {
            var settings = new CefSettings()
            {
                //By default CefSharp will use an in-memory cache, you need to     specify a Cache Folder to persist data
                CachePath =     Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.LocalApplicationData), "CefSharp\\Cache"),
            };

            //Autoshutdown when closing
            CefSharpSettings.ShutdownOnExit = true;

            //Perform dependency check to make sure all relevant resources are in our     output directory.
            Cef.Initialize(settings, performDependencyCheck: true, browserProcessHandler: null);

            RequestContext = new RequestContext();
            Page = new ChromiumWebBrowser("", null, RequestContext);
            PageInitialize();
        }

        /// <summary>
        /// Open the given url
        /// </summary>
        /// <param name="url">the url</param>
        /// <returns></returns>
        public void OpenUrl(string url)
        {
            try
            {
                Page.LoadingStateChanged += PageLoadingStateChanged;
                if (Page.IsBrowserInitialized)
                {
                    Page.Load(url);

                    //create a 60 sec timeout 
                    bool isSignalled = manualResetEvent.WaitOne(TimeSpan.FromSeconds(60));
                    manualResetEvent.Reset();

                    //As the request may actually get an answer, we'll force stop when the timeout is passed
                    if (!isSignalled)
                    {
                        Page.Stop();
                    }
                }
            }
            catch (ObjectDisposedException)
            {
                //happens on the manualResetEvent.Reset(); when a cancelation token has disposed the context
            }
            Page.LoadingStateChanged -= PageLoadingStateChanged;
        }

        /// <summary>
        /// Manage the IsLoading parameter
        /// </summary>
        /// <param name="sender"></param>
        /// <param name="e"></param>
        private void PageLoadingStateChanged(object sender, LoadingStateChangedEventArgs e)
        {
            // Check to see if loading is complete - this event is called twice, one when loading starts
            // second time when it's finished
            if (!e.IsLoading)
            {
                manualResetEvent.Set();
            }
        }

        /// <summary>
        /// Wait until page initialization
        /// </summary>
        private void PageInitialize()
        {
            SpinWait.SpinUntil(() => Page.IsBrowserInitialized);
        }
    }
}

Now in my app I just need to do the following:

public MainWindow()
{
    InitializeComponent();
    _browser = new Browser();
}

private async void GetGoogleSource()
{
    _browser.OpenUrl("http://icanhazip.com/");
    string source = await _browser.Page.GetSourceAsync();
}

And here is the string I get

"<html><head></head><body><pre style=\"word-wrap: break-word; white-space: pre-wrap;\">NotGonnaGiveYouMyIP:)\n</pre></body></html>"