HTMLagilitypack is not removing all html tags How can I solve this efficiently?

Obsivus picture Obsivus · Jun 1, 2013 · Viewed 10.8k times · Source

I am using following method to strip all html from the string:

public static string StripHtmlTags(string html)
        {
            if (String.IsNullOrEmpty(html)) return "";
            HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
            doc.LoadHtml(html);
            return doc.DocumentNode.InnerText;
        }

But it seems ignoring this following tag: […]

So the string returns basicly:

> A hungry thief who stole a rack of pork ribs from a grocery store has
> been sentenced to spend 50 years in prison. Willie Smith Ward felt the
> full force of the law after being convicted of the crime in Waco,
> Texas, on Wednesday. The 43-year-old may feel slightly aggrieved over
> the severity of the […]

How can I make sure that these kind of tags gets stripped?

Any kind of help is appreciated, thanks.

Answer

Damith picture Damith · Jun 1, 2013

Try HttpUtility.HtmlDecode

public static string StripHtmlTags(string html)
{
    if (String.IsNullOrEmpty(html)) return "";
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(html);
    return HttpUtility.HtmlDecode(doc.DocumentNode.InnerText);
}

HtmlDecode will convert […] to […]