How to extract data from website using AngleSharp & LINQ?

Question 1

How to extract data from website using AngleSharp & LINQ?

c# linq web-scraping data-extraction anglesharp

inquisitive_one · Sep 6, 2015 · Viewed 8.2k times · Source

Answer

Answer

I am late at the party, but I try to bring some sanity here.

Querying static webpages

For this we require the following set of tools / functionality:

HTTP requester (to obtain resources, e.g., HTML documents, via HTTP), potentially with a SSL/TLS layer on top (either accepting all certificates or working against the certificate store / known CAs)
HTML parser
A queryable object model representation of the parsed HTML document
Maybe additionally some cookie state and the ability to follow links / post forms

AngleSharp gives us all these options (minus a connection to the certificate store / known CAs; so in order to use HTTPS we must do some additional configuration, e.g., to accept all certificates).

We would start by creating an AngleSharp configuration that defines which capabilities are available for the browsing engine. This engine is exposed in form of a "browsing context", which can be regarded as a headless tab. In this tab we can open a new document (either from a local source, a constructed source, or a remote source).

var config = Configuration.Default.WithDefaultLoader();
var context = BrowsingContext.New(config);
var document = await context.OpenAsync("http://example.com");

Once we have the document we can use CSS query selectors to obtain certain elements. These elements can be used to gather the information we look for.

AngleSharp embraces LINQ (or IEnumerable in general), however, it makes sense to give full power to the queries if possible.

So instead of

var pricesListItemsLinq = document.All
    .Where(m => m.LocalName == "span" && m.ClassList.Equals("c-price"));

We write

var pricesListItemsLinq = document.QuerySelectorAll("span.c-price");

This is also much more robust (the ClassList is anyway a complex object giving access to a list of classes, so you either meant ClassList.Contains or ClassName.Equals (the latter being the string representation). Note: The two versions are not equivalent, because the former is looking for a class within the list of classes, while the latter is looking for a match of the whole class serialization (thus posing some extra boundary conditions on the match; it needs to be the only class).

Dealing with dynamic pages

This is far more complicated. The basics are the same as previously, but the engine needs to deliver a lot more than just the previously mentioned requirements. Additionally, we need

A JavaScript engine
A valid CSSOM
A fake (or even fully computed) rendering tree
A lot more DOM interfaces that can be found in real browsers (e.g., navigator, full history, web workers, ...) - the list is limitless here

While there is a project that delivers an experimental (and limited) C# only JS engine to AngleSharp, the latter two requirements cannot be fully fulfilled right now. Furthermore, the CSSOM may also be not complete enough for one or the other web application. Keep in mind that these pages are potentially designed for real browsers. They make certain assumptions. They may even require user input (e.g., Google Captcha).

Long story short.

var config = Configuration.Default
    .WithDefaultLoader()
    .WithCss()
    .WithJavaScript(); // maybe even more
var context = BrowsingContext.New(config);

The Task behind the await when opening a new document is equivalent to a load event in the DOM. Thus it will not fire when the document was downloaded and parsed, but only once all scripts have been loaded (and potentially run) incl. resources that needed to be downloaded.

Hope this helps a bit!

Question 2

I'm trying to extract the prices from the below mentioned website. I'm using AngleSharp for the extraction. In the website, the prices are listed below (as an example):

<span class="c-price">650.00                            </span>

I'm using the following code for the extraction.

using AngleSharp.Parser.Html;
using System.Net;
using System.Net.Http

//Make the request
var uri = "https://meadjohnson.world.tmall.com/search.htm?search=y&orderType=defaultSort&scene=taobao_shop";
var cancellationToken = new CancellationTokenSource();
var httpClient = new HttpClient();
var request = await httpClient.GetAsync(uri);
cancellationToken.Token.ThrowIfCancellationRequested();

//Get the response stream
var response = await request.Content.ReadAsStreamAsync();
cancellationToken.Token.ThrowIfCancellationRequested();

//Parse the stream
var parser = new HtmlParser();
var document = parser.Parse(response);

//Do something with LINQ
var pricesListItemsLinq = document.All
     .Where(m => m.LocalName == "span" && m.ClassList.Equals("c-price"));
Console.WriteLine(pricesListItemsLinq.Count());

However, I'm not getting any items, but they are there on the website. What am I doing wrong? If AngleSharp isn't the recommended method, what should I use? And what code should I use?

How to extract data from website using AngleSharp & LINQ?

Answer

Querying static webpages

Dealing with dynamic pages

Related questions