Running Scripts in HtmlAgilityPack

Aabela picture Aabela · Jul 9, 2012 · Viewed 22.2k times · Source

I'm trying to scrape a particular webpage which works as follows.

First the page loads, then it runs some sort of javascript to fetch the data it needs to populate the page. I'm interested in that data.

If I Get the page with HtmlAgilityPack - the script doesn't run so I get what it essentially a mostly-blank page.

Is there a way to force it to run a script, so I can get the data?

Answer

Jamie Treworgy picture Jamie Treworgy · Jul 9, 2012

You are getting what the server is returning - the same as a web browser. A web browser, of course, then runs the scripts. Html Agility Pack is an HTML parser only - it has no way to interpret the javascript or bind it to its internal representation of the document. If you wanted to run the script you would need a web browser. The perfect answer to your problem would be a complete "headless" web browser. That is something that incorporates an HTML parser, a javascript interpreter, and a model that simulates the browser DOM, all working together. Basically, that's a web browser, except without the rendering part of it. At this time there isn't such a thing that works entirely within the .NET environment.

Your best bet is to use a WebBrowser control and actually load and run the page in Internet Explorer under programmatic control. This won't be fast or pretty, but it will do what you need to do.

Also see my answer to a similar question: Load a DOM and Execute javascript, server side, with .Net which discusses the available technology in .NET to do this. Most of the pieces exist right now but just aren't quite there yet or haven't been integrated in the right way, unfortunately.