Which are the HTML, and XML, special characters?

xml http special-characters htmlspecialchars entityreference

Ian Boyd · Aug 30, 2011 · Viewed 33.3k times · Source

What are the special reserved character entities in HTML and in XML?

The information that I have says:

HTML:

& (replace with &)
< (replace with <)
> (replace with >)
" (replace with ")
' (replace with ')

XML:

< (replace with <)
> (replace with >)
& (replace with &)
' (replace with ')
" (replace with ")

But I cannot find documentation on either of these.

The W3C does mention, in Extensible Markup Language (XML) 1.0 (Fifth Edition), certain predefined entity references. But it says that these entities are predefined (in the same way that © is predefined); not that they must be escaped:

4.6 Predefined Entities

[Definition: Entity and character references may both be used to escape the left angle bracket, ampersand, and other delimiters. A set of general entities (amp, lt, gt, apos, quot) is specified for this purpose. Numeric character references may also be used; they are expanded immediately when recognized and must be treated as character data, so the numeric character references " < " and " & " may be used to escape < and & when they occur in character data.]

What characters must be escaped into entity references in HTML? What characters must be escaped into entity references in XML?

Update:

From Extensible Markup Language (XML) 1.0 (Fifth Edition):

2.4 Character Data and Markup

The ampersand character (&) and the left angle bracket (<) must not appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they must be escaped using either numeric character references or the strings "&" and "<" respectively.

The right angle bracket (>) may be represented using the string ">", and must, for compatibility, be escaped using either ">" or a character reference when it appears in the string "]]>" in content, when that string is not marking the end of a CDATA section.

To allow attribute values to contain both single and double quotes, the apostrophe or single-quote character (') may be represented as "'", and the double-quote character (") as """.

I read the former as saying that

must be:

< (<) must be
& (&) must be

may, but must when appearing as ]]>

> (>) must be, if appearing as ]]>

And that ' and " don't have to be escaped at all; unless you want to have quotes inside quoted attributes.

From HTML 4.01 Specification, HTML Document Representation:

5.3.2 Character entity references

Authors wishing to put the "<" character in text should use "<" (ASCII decimal 60) to avoid possible confusion with the beginning of a tag (start tag open delimiter).

Similarly, authors should use ">" (ASCII decimal 62) in text instead of ">" to avoid problems with older user agents that incorrectly perceive this as the end of a tag (tag close delimiter) when it appears in quoted attribute values.

Authors should use "&" (ASCII decimal 38) instead of "&" to avoid confusion with the beginning of a character reference (entity reference open delimiter). Authors should also use "&" in attribute values since character references are allowed within CDATA attribute values.

Some authors use the character entity reference """ to encode instances of the double quote mark (") since that character may be used to delimit attribute values.

HTML is much more wishy-washy on the rules, but it sounds like I should:

< should be with <
> should be with >
& should be with &
" should be with "

And if " can be an entity reference, I should also replace ' with &.

Update Two

From HTML5 - A vocabulary and associated APIs for HTML and XHTML:

8.3 Serializing HTML fragments

Escaping a string (for the purposes of the algorithm above) consists of running the following steps:

Replace any occurrence of the "&" character by the string "&".

Replace any occurrences of the U+00A0 NO-BREAK SPACE character by the string " ".

If the algorithm was invoked in the attribute mode, replace any occurrences of the """ character by the string """.

If the algorithm was not invoked in the attribute mode, replace any occurrences of the "<" character by the string "<", and any occurrences of the ">" character by the string ">".

Which I read as HTML:

& by & always
by   always
" by " if it's inside an attribute
< by < if it's not in an attribute (i.e. attributes can contain <)
> by > if it's not in an attribute (i.e. attributes can contain >)

Answer

First, you're comparing a HTML 4.01 specification with an HTML 5 one. HTML5 ties more closely in with XML than HTML 4.01 ever does (that's why we have XHTML), so this answer will stick to HTML 5 and XML.

Your quoted references are all consistent on the following points:

< should always be represented with < when not indicating a processing instruction
> should always be represented with > when not indicating a processing instruction
& should always be represented with &
except when within <![CDATA[ ]]> (which only applies to XML)

I agree 100% with this. You never want the parser to mistake literals for instructions, so it's a solid idea to always encode any non-space (see below) character. Good parsers know that anything contained within <![CDATA[ ]]> are not instructions, so the encoding is not necessary there.

In practice, I never encode ' or " unless

it appears within the value of an attribute (XML or HTML)
it appears within the text of XML tags. (<tag>"Yoinks!", he said.</tag>)

Both specifications also agree with this.

So, the only point of contention is the (space). The only mention of it in either specification is when serialization is attempted. When not, you should always use a literal (space). Unless you are writing your own parser, I don't see the need to be doing any kind of serialization, so this is beside the point.

Which are the HTML, and XML, special characters?

4.6 Predefined Entities

2.4 Character Data and Markup

5.3.2 Character entity references

Update Two

8.3 Serializing HTML fragments

Answer

Related questions