When i need to escape Html string?

M Sach picture M Sach · Feb 8, 2013 · Viewed 35.5k times · Source

In my legacy project i can see the usage of escapeHtml before string is sent to browser.

StringEscapeUtils.escapeHtml(stringBody);

I know from api doc what escapeHtml does.here is the example given:-

For example: 
"bread" & "butter"
becomes: 
"bread" & "butter".

My understanding is when we send the string after escaping html its the browser responsibility that converts back to original characters. Is that right?

But i am not getting why and when it is required and what happens if we send the string body without escaping html? what is the cost if we dont do escapeHtml before sending it to browser

Answer

Ted Hopp picture Ted Hopp · Feb 8, 2013

I can think of several possibilities to explain why sometimes a string is not escaped:

  • perhaps the original programmer was confident that at certain places the string had no special characters (however, in my opinion this would be bad programming practice; it costs very little to escape a string as protection against future changes)
  • the string was already escaped at that point in the code. You definitely don't want to escape a string twice; the user will end up seeing the escape sequence instead of the intended text.
  • The string was the actual html itself. You don't want to escape the html; you want the browser to process it!

EDIT - The reason for escaping is that special characters like & and < can end up causing the browser to display something other than what you intended. A bare & is technically an error in the html. Most browsers try to deal intelligently with such errors and will display them correctly in most cases. (This will almost certainly happen in your example text if the string were text in a <div>, for instance.) However, because it is bad markup, some browsers will not work well; assistive technologies (e.g., text-to-speech) may fail; and there may be other problems.

There are several cases that will fail despite the best efforts of the browser to recover from bad markup. If your sample string were an attribute value, escaping the quote marks would be absolutely required. There's no way that a browser is going to correctly handle something like:

<img alt=""bread" & "butter"" ... >

The general rule is that any character that is not markup but might be confused as markup need to be escaped.

Note that there are several contexts in which text can appear within an html document, and they have separate requirements for escaping. The following should be escaped:

  • all characters that have no representation in the character set of the document (unlikely if you are using UTF-8, but that's not always the case)
  • Within attribute values, quote marks (' or ", whichever one matches the delimiters used for the attribute value itself) and the ampersand (&), but not <
  • Within text nodes, only & and <
  • Within href values, characters that need escaping in a url (and sometimes these need to be doubly escaped so they are still escaped after the browser unescapes them once)
  • Within a CDATA block, generally nothing (at the HTML level).

Finally, aside from the hazard of double-escaping, the cost of escaping all text is minimal: a tiny bit of extra processing and a few extra bytes on the network.