Is this a valid (well-formed) XML document?
<?xml version="1.0" encoding="UTF-8" ?>
<outer>
<inner>©</inner>
</outer>
At issue is whether the HTML/XHTML "©" entity encoding is valid in an XML document where there is no DTD or schema to define it. An alternative way of expressing the above would be to say this:
<?xml version="1.0" encoding="UTF-8" ?>
<outer>
<inner>©</inner>
</outer>
Which would seem to be valid XML with a UTF-8 encoding.
But is this valid:
<?xml version="1.0" encoding="UTF-8" ?>
<outer>
<inner><![CDATA[©]]></inner>
</outer>
The author of the above intends to indicate to the XML parser that it should pass through the copyright symbol above as the string "©" rather than as a proper Unicode character.
In that respect I find this quote a little confusing: 'New authors of XML documents often misunderstand the purpose of a CDATA section, mistakenly believing that its purpose is to "protect" data from being treated as ordinary character data during processing. [But] Character data is character data, regardless of whether it is expressed via a CDATA section or ordinary markup." (From Wikipedia)
I am seperately looking at a proposed XML format from a second author who has wrapped every tag in CDATA sections even when the tag can, for example, only contain digits.
Hope an XML guru can help clear up the confusion on the purpose of CDATA.
Thanks!
A CDATA section is for the purpose of allowing literal text that would normally be interpreted in a special way in an XML document. That is, something that looks like an entity reference, or something that looks like XML tags. Anything in a CDATA section can be inside valid XML without a CDATA section; you'll just need to use entity references to encode the various special characters so they won't be treated as XML markup, but as character data that is the value of a tag.
So yes, the following is perfectly valid, as long as it is what you intend:
<?xml version="1.0" encoding="UTF-8" ?>
<outer>
<inner><![CDATA[©]]></inner>
</outer>
Here, the value of the inner
element is the value ©
which will not be interpreted by the XML parser as the entity reference for the copyright symbol. You can also do the following:
<?xml version="1.0" encoding="UTF-8" ?>
<outer>
<inner><![CDATA[<normally> this looks <like/> & xml </normally>]]></inner>
</outer>
where the value for the inner
element is
<normally> this looks <like/> & xml </normally>
To do this without a CDATA section:
<?xml version="1.0" encoding="UTF-8" ?>
<outer>
<inner><normally> this looks <like/> &amp; xml </normally></inner>
</outer>
which is much less human-readable, but equivalent as far as an XML parser is concerned. If you did this (assuming that the inner
element is defined an a schema or DTD as containing a string and not XML) then your XML parser will complain:
<?xml version="1.0" encoding="UTF-8" ?>
<outer>
<inner><normally> this looks <like/> & xml </normally></inner>
</outer>
so you use the CDATA or entity escaping to protect the special characters from the XML parser so the client of the XML data can get the value of inner
which happens to contain XML markup characters.
Note: To be clear, the above example is well formed XML, but if the schema or DTD says that the element inner
contains xsd:string or equivalent, then it is an invalid XML document.
And no, HTML or XHTML entities that are not defined as part of XML itself are not valid XML unless they are defined. Your XML parser will return an error.