Stripping all html tags with Html Agility Pack

Muleskinner picture Muleskinner · Jun 29, 2010 · Viewed 16.6k times · Source

I have a html string like this:

<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>

I wish to strip all html tags so that the resulting string becomes:

foo bar baz

From another post here at SO I've come up with this function (which uses the Html Agility Pack):

  Public Shared Function stripTags(ByVal html As String) As String
    Dim plain As String = String.Empty
    Dim htmldoc As New HtmlAgilityPack.HtmlDocument

    htmldoc.LoadHtml(html)
    Dim invalidNodes As HtmlAgilityPack.HtmlNodeCollection = htmldoc.DocumentNode.SelectNodes("//html|//body|//p|//a")

    If Not htmldoc Is Nothing Then
      For Each node In invalidNodes
        node.ParentNode.RemoveChild(node, True)
      Next
    End If

    Return htmldoc.DocumentNode.WriteContentTo
  End Function

Unfortunately this does not return what I expect, instead it gives:

bazbarfoo

Please, where do I go wrong - and is this the best approach?

Regards and happy coding!

UPDATE: by the answer below I came up with this function, might be usefull to others:

  Public Shared Function stripTags(ByVal html As String) As String
    Dim htmldoc As New HtmlAgilityPack.HtmlDocument
    htmldoc.LoadHtml(html.Replace("</p>", "</p>" & New String(Environment.NewLine, 2)).Replace("<br/>", Environment.NewLine))
    Return htmldoc.DocumentNode.InnerText
  End Function

Answer

tvanfosson picture tvanfosson · Jun 29, 2010

Why not just return htmldoc.DocumentNode.InnerText instead of removing all the non-text nodes? It should give you what you want.