I am taking a stab at html agility pack and having trouble finding the right way to go about this.
For example:
var findclasses = _doc.DocumentNode.Descendants("div").Where(d => d.Attributes.Contains("class"));
However, obviously you can add classes to a lot more then divs so I tried this..
var allLinksWithDivAndClass = _doc.DocumentNode.SelectNodes("//*[@class=\"float\"]");
But that doesn't handle the cases where you add multiple classes and "float" is just one of them like this..
class="className float anotherclassName"
Is there a way to handle all of this? I basically want to select all nodes that have a class = and contains float.
**Answer has been documented on my blog with a full explanation at: Html Agility Pack Get All Elements by Class
(Updated 2018-03-17)
The problem, as you've spotted, is that String.Contains
does not perform a word-boundary check, so Contains("float")
will return true
for both "foo float bar" (correct) and "unfloating" (which is incorrect).
The solution is to ensure that "float" (or whatever your desired class-name is) appears alongside a word-boundary at both ends. A word-boundary is either the start (or end) of a string (or line), whitespace, certain punctuation, etc. In most regular-expressions this is \b
. So the regex you want is simply: \bfloat\b
.
A downside to using a Regex
instance is that they can be slow to run if you don't use the .Compiled
option - and they can be slow to compile. So you should cache the regex instance. This is more difficult if the class-name you're looking for changes at runtime.
Alternatively you can search a string for words by word-boundaries without using a regex by implementing the regex as a C# string-processing function, being careful not to cause any new string or other object allocation (e.g. not using String.Split
).
Suppose you just want to look for elements with a single, design-time specified class-name:
class Program {
private static readonly Regex _classNameRegex = new Regex( @"\bfloat\b", RegexOptions.Compiled );
private static IEnumerable<HtmlNode> GetFloatElements(HtmlDocument doc) {
return doc
.Descendants()
.Where( n => n.NodeType == NodeType.Element )
.Where( e => e.Name == "div" && _classNameRegex.IsMatch( e.GetAttributeValue("class", "") ) );
}
}
If you need to choose a single class-name at runtime then you can build a regex:
private static IEnumerable<HtmlNode> GetElementsWithClass(HtmlDocument doc, String className) {
Regex regex = new Regex( "\\b" + Regex.Escape( className ) + "\\b", RegexOptions.Compiled );
return doc
.Descendants()
.Where( n => n.NodeType == NodeType.Element )
.Where( e => e.Name == "div" && regex.IsMatch( e.GetAttributeValue("class", "") ) );
}
If you have multiple class-names and you want to match all of them, you could create an array of Regex
objects and ensure they're all matching, or combine them into a single Regex
using lookarounds, but this results in horrendously complicated expressions - so using a Regex[]
is probably better:
using System.Linq;
private static IEnumerable<HtmlNode> GetElementsWithClass(HtmlDocument doc, String[] classNames) {
Regex[] exprs = new Regex[ classNames.Length ];
for( Int32 i = 0; i < exprs.Length; i++ ) {
exprs[i] = new Regex( "\\b" + Regex.Escape( classNames[i] ) + "\\b", RegexOptions.Compiled );
}
return doc
.Descendants()
.Where( n => n.NodeType == NodeType.Element )
.Where( e =>
e.Name == "div" &&
exprs.All( r =>
r.IsMatch( e.GetAttributeValue("class", "") )
)
);
}
The advantage of using a custom C# method to do string matching instead of a regex is hypothetically faster performance and reduced memory usage (though Regex
may be faster in some circumstances - always profile your code first, kids!)
This method below: CheapClassListContains
provides a fast word-boundary-checking string matching function that can be used the same way as regex.IsMatch
:
private static IEnumerable<HtmlNode> GetElementsWithClass(HtmlDocument doc, String className) {
return doc
.Descendants()
.Where( n => n.NodeType == NodeType.Element )
.Where( e =>
e.Name == "div" &&
CheapClassListContains(
e.GetAttributeValue("class", ""),
className,
StringComparison.Ordinal
)
);
}
/// <summary>Performs optionally-whitespace-padded string search without new string allocations.</summary>
/// <remarks>A regex might also work, but constructing a new regex every time this method is called would be expensive.</remarks>
private static Boolean CheapClassListContains(String haystack, String needle, StringComparison comparison)
{
if( String.Equals( haystack, needle, comparison ) ) return true;
Int32 idx = 0;
while( idx + needle.Length <= haystack.Length )
{
idx = haystack.IndexOf( needle, idx, comparison );
if( idx == -1 ) return false;
Int32 end = idx + needle.Length;
// Needle must be enclosed in whitespace or be at the start/end of string
Boolean validStart = idx == 0 || Char.IsWhiteSpace( haystack[idx - 1] );
Boolean validEnd = end == haystack.Length || Char.IsWhiteSpace( haystack[end] );
if( validStart && validEnd ) return true;
idx++;
}
return false;
}
HtmlAgilityPack is somewhat stagnated doesn't support .querySelector
and .querySelectorAll
, but there are third-party libraries that extend HtmlAgilityPack with it: namely Fizzler and CssSelectors. Both Fizzler and CssSelectors implement QuerySelectorAll
, so you can use it like so:
private static IEnumerable<HtmlNode> GetDivElementsWithFloatClass(HtmlDocument doc) {
return doc.QuerySelectorAll( "div.float" );
}
With runtime-defined classes:
private static IEnumerable<HtmlNode> GetDivElementsWithClasses(HtmlDocument doc, IEnumerable<String> classNames) {
String selector = "div." + String.Join( ".", classNames );
return doc.QuerySelectorAll( selector );
}