I am implementing an XSS filter for my web application and also using the ESAPI encoder to sanitise the input.
The patterns I am using are as given below,
// Script fragments
Pattern.compile("<script>(.*?)</script>", Pattern.CASE_INSENSITIVE),
// src='...'
Pattern.compile("src[\r\n]*=[\r\n]*\\\'(.*?)\\\'", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL),
Pattern.compile("src[\r\n]*=[\r\n]*\\\"(.*?)\\\"", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL),
// lonely script tags
Pattern.compile("</script>", Pattern.CASE_INSENSITIVE),
Pattern.compile("<script(.*?)>", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL),
// eval(...)
Pattern.compile("eval\\((.*?)\\)", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL),
// expression(...)
Pattern.compile("expression\\((.*?)\\)", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL),
// javascript:...
Pattern.compile("javascript:", Pattern.CASE_INSENSITIVE),
// vbscript:...
Pattern.compile("vbscript:", Pattern.CASE_INSENSITIVE),
// onload(...)=...
Pattern.compile("onload(.*?)=", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL)
But, still a few script are not getting filtered specially the one which are appended to a parameter like
url?sourceId=abx;alert('hello');
How do I handle these?
This isn't the right approach. It's mathematically impossible to write a regex capable of correctly punting XSS. (Regex is "regular" but HTML and Javascript are both context-free grammars.)
You can however guarantee that when you switch contexts, (hand off a piece of data that is going to be interpreted) that the data is correctly escaped for that context switch. So, when sending data to a browser, escape it for HTML if its being handled as HTML or as Javascript if its being handled by javascript.
If you DO need to allow HTML/javascript into your application, then you'll want a web-application firewall or a framework like HDIV.