I have a html string like this:
<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>
I wish to strip all html tags so that the resulting string becomes:
foo bar baz
From another post here at SO I've come up with this function (which uses the Html Agility Pack):
Public Shared Function stripTags(ByVal html As String) As String
Dim plain As String = String.Empty
Dim htmldoc As New HtmlAgilityPack.HtmlDocument
htmldoc.LoadHtml(html)
Dim invalidNodes As HtmlAgilityPack.HtmlNodeCollection = htmldoc.DocumentNode.SelectNodes("//html|//body|//p|//a")
If Not htmldoc Is Nothing Then
For Each node In invalidNodes
node.ParentNode.RemoveChild(node, True)
Next
End If
Return htmldoc.DocumentNode.WriteContentTo
End Function
Unfortunately this does not return what I expect, instead it gives:
bazbarfoo
Please, where do I go wrong - and is this the best approach?
Regards and happy coding!
UPDATE: by the answer below I came up with this function, might be usefull to others:
Public Shared Function stripTags(ByVal html As String) As String
Dim htmldoc As New HtmlAgilityPack.HtmlDocument
htmldoc.LoadHtml(html.Replace("</p>", "</p>" & New String(Environment.NewLine, 2)).Replace("<br/>", Environment.NewLine))
Return htmldoc.DocumentNode.InnerText
End Function
Why not just return htmldoc.DocumentNode.InnerText
instead of removing all the non-text nodes? It should give you what you want.