Secure XSS cleaning function (updated regularly)

zuallauz picture zuallauz · Jun 17, 2011 · Viewed 14.1k times · Source

I've been hunting around the net now for a few days trying to figure this out but getting conflicting answers.

Is there a library, class or function for PHP that securely sanitizes/encodes a string against XSS? It needs to be updated regularly to counter new attacks.

I have a few use cases:

Use case 1) I have a plain text field, say for a First Name or Last Name

  • User enters text into field and submits the form
  • Before this is saved to the database I want to a) trim any whitespace off the front and end of the string, and b) strip all HTML tags from the input. It's a name text field, they shouldn't have any HTML in it.
  • Then I will save this to the database with PDO prepared statements.

I'm thinking I could just do trim() and strip_tags() then use a Sanitize Filter or RegEx with a whitelist of characters. Do they really need characters like ! and ? or < > in their name, not really.

Use case 2) When outputting the contents from a previously saved database record (or from a previously submitted form) to the View/HTML I want to thoroughly clean it for XSS. NB: It may or may not have gone through the filtering step in use case 1 as it could be a different type of input, so assume no sanitizing has been done.

Initially I though HTMLPurifier would do the job, but as it seems it is not what I need when I posed the question to their support:

Here's the litmus test: if a user submits <b>foo</b> should it show up as <b>foo</b> or foo? If the former, you don't need HTML Purifier.

So I'd rather it showed up as <b>foo</b> because I don't want any HTML displayed for a simple text field or any JavaScript executing.

So I've been hunting around for a function that will do it all for me. I stumbled across the xss_clean method used by Kohana 3.0 which I'm guessing works but it's only if you want to keep the HTML. It's now deprecated from Kohana 3.1 as they've replaced it with HTMLPurifier. So I'm guessing you're supposed to do HTML::chars() instead which only does this code:

public static function chars($value, $double_encode = TRUE)
{
    return htmlspecialchars( (string) $value, ENT_QUOTES, Kohana::$charset, $double_encode);
}

Now apparently you're supposed to use htmlentities instead as mentioned in quite a few places in Stack Overflow because it's more secure than htmlspecialchars.

  • So how do I use htmlentities properly?
  • Is that all I need?
  • How does it protect against hex, decimal and base64 encoded values being sent from the attacks listed here?

Now I see that the 3rd parameter for the htmlentities method is the charset to be used in conversion. Now my site/db is in UTF-8, but perhaps the form submitted data was not UTF-8 encoded, maybe they submitted ASCII or HEX so maybe I need to convert it to UTF-8 first? That would mean some code like:

$encoding = mb_detect_encoding($input);
$input = mb_convert_encoding($input, 'UTF-8', $encoding);
$input = htmlentities($input, ENT_QUOTES, 'UTF-8');

Yes or no? Then I'm still not sure how to protect against the hex, decimal and base64 possible XSS inputs...

If there's some library or open source PHP framework that can do XSS protection properly I'd be interested to see how they do it in code.

Any help much appreciated, sorry for the long post!

Answer

phihag picture phihag · Jun 17, 2011

To answer the bold question: Yes, there is. It's called htmlspecialchars.

It needs to be updated regularly to counter new attacks.

The right way to prevent XSS attacks is not countering specific attacks, filtering/sanitizing data, but proper encoding, everywhere.

htmlspecialchars (or htmlentities) in conjunction with a reasonable decision of character encoding (i.e. UTF-8) and explicit specification of character encoding is sufficient to prevent against all XSS attacks. Fortunately, calling htmlspecialchars without explicit encoding(it then assumes ISO-8859-1) happens to work out for UTF-8, too. If you want to make that explicit, create a helper function:

// Don't forget to specify UTF-8 as the document's encoding
function htmlEncode($s) {
    return htmlspecialchars($s, ENT_QUOTES, 'UTF-8');
}

Oh, and to address the form worries: Don't try to detect encodings, it's bound to fail. Instead, give out the form in UTF-8. Every browser will send user inputs in UTF-8 then.

Addressing specific concerns:

(...) you're supposed to use htmlentities because htmlspecialchars is vulnerable to UTF-7 XSS exploit.

The UTF-7 XSS exploit can only be applied if the browser thinks a document is encoded in UTF-7. Specifying the document encoding as UTF-8 (in the HTTP header/a meta tag right after <head>) prevents this.

Also if I don't detect the encoding, what's to stop an attacker downloading the html file, then altering it to UTF-7 or some other encoding, then submitting the POST request back to my server from the altered html page?

This attack scenario is unnecessarily complex. The attacker could just craft a UTF-7 string, no need to download anything.

If you accept the attacker's POST (i.e. you're accepting anonymous public user input), your server will just interpret the UTF-7 string as a weird UTF-8 one. That is not a problem, the attacker's post will just show garbled. The attacker could achieve the same effect (sending strange text) by submitting "grfnlk" a hundred times.

If my method only works for UTF-8 then the XSS attack will get through, no?

No, it won't. Encodings are not magic. An encoding is just a way to interpret a binary string. For example, the string "ö" is encoded as (hexadecimal) 2B 41 50 59 in UTF-7 (and C3 B6 in UTF-8). Decoding 2B 41 50 59 as UTF-8 yields "+APY" - harmless, seemingly randomly characters.

Also how does htmlentities protect against HEX or other XSS attacks?

Hexadecimal data will be outputted as just that. An attacker sending "3C" will post a message "3C". "3C" can only become < if you actively try to interpret hexadecimal inputs otherwise, for example actively map them into unicode code points and then output them. That just means if you're accepting data in something but plain UTF-8 (for example base32-encoded UTF-8), you'll first have to unpack your encoding, and then use htmlspecialchars before including it between HTML code.