C# - Best Approach to Parsing Webpage?

c# html xml html-content-extraction

MattSayar · Nov 18, 2008 · Viewed 19.7k times · Source

I've saved an entire webpage's html to a string, and now I want to grab the "href" values from the links, preferably with the ability to save them to different strings later. What's the best way to do this?

I've tried saving the string as an .xml doc and parsing it using an XPathDocument navigator, but (surprise surprise) it doesn't navigate a not-really-an-xml-document too well.

Are regular expressions the best way to achieve what I'm trying to accomplish?

Answer

I can recommend the HTML Agility Pack. I've used it in a few cases where I needed to parse HTML and it works great. Once you load your HTML into it, you can use XPath expressions to query the document and get your anchor tags (as well as just about anything else in there).

HtmlDocument yourDoc = // load your HTML;
int someCount = yourDoc.DocumentNode.SelectNodes("your_xpath").Count;

C# - Best Approach to Parsing Webpage?

Answer

Related questions