Does a valid XML file require an XML declaration?

eros picture eros · Aug 10, 2011 · Viewed 156.5k times · Source

I am parsing an XML file using Sax Parser of Xerces.
Is the XML declaration <?xml version="1.0" encoding="UTF-8"?> required?

Answer

Hoylen picture Hoylen · Aug 10, 2011

In XML 1.0, the XML Declaration is optional. See section 2.8 of the XML 1.0 Recommendation, where it says it "should" be used -- which means it is recommended, but not mandatory. In XML 1.1, however, the declaration is mandatory. See section 2.8 of the XML 1.1 Recommendation, where it says "MUST" be used. It even goes on to state that if the declaration is absent, that automatically implies the document is an XML 1.0 document.

Note that in an XML Declaration the encoding and standalone are both optional. Only the version is mandatory. Also, these are not attributes, so if they are present they must be in that order: version, followed by any encoding, followed by any standalone.

<?xml version="1.0"?>
<?xml version="1.0" encoding="UTF-8"?>
<?xml version="1.0" standalone="yes"?>
<?xml version="1.0" encoding="UTF-16" standalone="yes"?>

If you don't specify the encoding in this way, XML parsers try to guess what encoding is being used. The XML 1.0 Recommendation describes one possible way character encoding can be autodetected. In practice, this is not much of a problem if the input is encoded as UTF-8, UTF-16 or US-ASCII. Autodetection doesn't work when it encounters 8-bit encodings that use characters outside the US-ASCII range (e.g. ISO 8859-1) -- avoid creating these if you can.

The standalone indicates whether the XML document can be correctly processed without the DTD or not. People rarely use it. These days, it is a bad to design an XML format that is missing information without its DTD.

Update:

A "prolog error/invalid utf-8 encoding" error indicates that the actual data the parser found inside the file did not match the encoding that the XML declaration says it is. Or in some cases the data inside the file did not match the autodetected encoding.

Since your file contains a byte-order-mark (BOM) it should be in UTF-16 encoding. I suspect that your declaration says <?xml version="1.0" encoding="UTF-8"?> which is obviously incorrect when the file has been changed into UTF-16 by NotePad. The simple solution is to remove the encoding and simply say <?xml version="1.0"?>. You could also edit it to say encoding="UTF-16" but that would be wrong for the original file (which wasn't in UTF-16) or if the file somehow gets changed back to UTF-8 or some other encoding.

Don't bother trying to remove the BOM -- that's not the cause of the problem. Using NotePad or WordPad to edit XML is the real problem!