How can I strip HTML tags from a string in ASP.NET?

daniel picture daniel · Apr 24, 2009 · Viewed 135.8k times · Source

Using ASP.NET, how can I strip the HTML tags from a given string reliably (i.e. not using regex)? I am looking for something like PHP's strip_tags.

Example:

<ul><li>Hello</li></ul>

Output:

"Hello"

I am trying not to reinvent the wheel, but I have not found anything that meets my needs so far.

Answer

Tomalak picture Tomalak · Apr 24, 2009

If it is just stripping all HTML tags from a string, this works reliably with regex as well. Replace:

<[^>]*(>|$)

with the empty string, globally. Don't forget to normalize the string afterwards, replacing:

[\s\r\n]+

with a single space, and trimming the result. Optionally replace any HTML character entities back to the actual characters.

Note:

  1. There is a limitation: HTML and XML allow > in attribute values. This solution will return broken markup when encountering such values.
  2. The solution is technically safe, as in: The result will never contain anything that could be used to do cross site scripting or to break a page layout. It is just not very clean.
  3. As with all things HTML and regex:
    Use a proper parser if you must get it right under all circumstances.