My Windows Forms application hosts a WebBrowser
control that displays a page full of links. I'm trying to find all the anchor elements in the loaded HtmlDocument
and read their href
attributes so I can provide a multi-file download interface in C#. Below is a simplified version of the function where I find and process the anchor elements:
public void ListAnchors(string baseUrl, HtmlDocument doc) // doc is retrieved from webBrowser.Document
{
HtmlElementCollection anchors = doc.GetElementsByTagName("a");
foreach (HtmlElement el in anchors)
{
string href = el.GetAttribute("href");
Debug.WriteLine("el.Parent.InnerHtml = " + el.Parent.InnerHtml);
Debug.WriteLine("el.GetAttribute(\"href\") = " + href);
}
}
The anchor tags are all surrounded by <PRE>
tags. The hostname from which I'm loading the HTML is a local machine on the network (lts930411). The source HTML for one entry looks like this:
<PRE><A href="/A/a150923a.lts">a150923a.lts</A></PRE>
The output of the above C# code for one anchor element is this:
el.Parent.InnerHtml = <A href="/A/a150923a.lts">a150923a.lts</A>
el.GetAttribute("href") = http://lts930411/A/a150923a.lts
Why is el.GetAttribute("href")
adding the scheme and hostname prefix (http://lts930411
) rather than returning the literal value of the href
attribute from the source HTML? Is this behavior I can count on? Is this "feature" documented somewhere? (I was prepending the base URL myself, but that gave me addresses like http://lts930411http://lts930411/A/a150923a.lts
. I'd be okay with just expecting the full URL if I could find documentation promising this will always happen.)
As stated in IHTMLAnchorElement.href
documents, relative urls are resolved against the location of the document containing the a
element.
As an option to get untouched href
attribute values, you can use this code:
var expression = "href=\"(.*)\"";
var list = document.GetElementsByTagName("a")
.Cast<HtmlElement>()
.Where(x => Regex.IsMatch(x.OuterHtml, expression))
.Select(x => Regex.Match(x.OuterHtml, expression).Groups[1].Value)
.ToList();
The above code, returns untouched href
attribute value of all a
tags in a document.