I've got the common situation where I've got user input that uses a subset of HTML (input with tinyMCE). I need to have some server-side protection against XSS attacks and am looking for a well-tested tool that people are using to do this. On the PHP side I'm seeing lots of libraries like HTMLPurifier that do the job, but I can't seem to find anything in .NET.
I'm basically looking for a library to filter down to a whitelist of tags, attributes on those tags, and does the right thing with "difficult" attributes like a:href and img:src
I've seen Jeff Atwood's post at http://refactormycode.com/codes/333-sanitize-html, but I don't know how up-to-date it is. Does it have any bearing at all to what the site is currently using? And in any case I'm not sure I'm comfortable with that strategy of trying to regexp out valid input.
This blog post lays out what seems to be a much more compelling strategy:
This method is to actually parse the HTML into a DOM, validate that, then rebuild valid HTML from it. If the HTML parsing can handle malformed HTML sensibly, then great. If not, no big deal -- I can demand well-formed HTML since the users should be using the tinyMCE editor. In either case I'm rewriting what I know is safe, well-formed HTML.
The problem is that's just a description, without a link to any library that actually executes that algorithm.
Does such a library exist? If not, what would be a good .NET HTML parsing engine? And what regular expressions should be used to perform extra validation a:href, img:src? Am I missing something else important here?
I don't want re-implement a buggy wheel here. Surely there's some commonly used libraries out there. Any ideas?
We are using the HtmlSanitizer .Net library, which:
Also on NuGet