I'm trying to extract the prices from the below mentioned website. I'm using AngleSharp for the extraction. In the website, the prices are listed below (as an example):
<span class="c-price">650.00 </span>
I'm using the following code for the extraction.
using AngleSharp.Parser.Html;
using System.Net;
using System.Net.Http
//Make the request
var uri = "https://meadjohnson.world.tmall.com/search.htm?search=y&orderType=defaultSort&scene=taobao_shop";
var cancellationToken = new CancellationTokenSource();
var httpClient = new HttpClient();
var request = await httpClient.GetAsync(uri);
cancellationToken.Token.ThrowIfCancellationRequested();
//Get the response stream
var response = await request.Content.ReadAsStreamAsync();
cancellationToken.Token.ThrowIfCancellationRequested();
//Parse the stream
var parser = new HtmlParser();
var document = parser.Parse(response);
//Do something with LINQ
var pricesListItemsLinq = document.All
.Where(m => m.LocalName == "span" && m.ClassList.Equals("c-price"));
Console.WriteLine(pricesListItemsLinq.Count());
However, I'm not getting any items, but they are there on the website. What am I doing wrong? If AngleSharp isn't the recommended method, what should I use? And what code should I use?
I am late at the party, but I try to bring some sanity here.
For this we require the following set of tools / functionality:
AngleSharp gives us all these options (minus a connection to the certificate store / known CAs; so in order to use HTTPS we must do some additional configuration, e.g., to accept all certificates).
We would start by creating an AngleSharp configuration that defines which capabilities are available for the browsing engine. This engine is exposed in form of a "browsing context", which can be regarded as a headless tab. In this tab we can open a new document (either from a local source, a constructed source, or a remote source).
var config = Configuration.Default.WithDefaultLoader();
var context = BrowsingContext.New(config);
var document = await context.OpenAsync("http://example.com");
Once we have the document we can use CSS query selectors to obtain certain elements. These elements can be used to gather the information we look for.
AngleSharp embraces LINQ (or IEnumerable in general), however, it makes sense to give full power to the queries if possible.
So instead of
var pricesListItemsLinq = document.All
.Where(m => m.LocalName == "span" && m.ClassList.Equals("c-price"));
We write
var pricesListItemsLinq = document.QuerySelectorAll("span.c-price");
This is also much more robust (the ClassList
is anyway a complex object giving access to a list of classes, so you either meant ClassList.Contains
or ClassName.Equals
(the latter being the string representation). Note: The two versions are not equivalent, because the former is looking for a class within the list of classes, while the latter is looking for a match of the whole class serialization (thus posing some extra boundary conditions on the match; it needs to be the only class).
This is far more complicated. The basics are the same as previously, but the engine needs to deliver a lot more than just the previously mentioned requirements. Additionally, we need
While there is a project that delivers an experimental (and limited) C# only JS engine to AngleSharp, the latter two requirements cannot be fully fulfilled right now. Furthermore, the CSSOM may also be not complete enough for one or the other web application. Keep in mind that these pages are potentially designed for real browsers. They make certain assumptions. They may even require user input (e.g., Google Captcha).
Long story short.
var config = Configuration.Default
.WithDefaultLoader()
.WithCss()
.WithJavaScript(); // maybe even more
var context = BrowsingContext.New(config);
The Task behind the await
when opening a new document is equivalent to a load
event in the DOM. Thus it will not fire when the document was downloaded and parsed, but only once all scripts have been loaded (and potentially run) incl. resources that needed to be downloaded.
Hope this helps a bit!