XmlReader breaks on UTF-8 BOM

Matt Mills picture Matt Mills · Jun 23, 2010 · Viewed 12.4k times · Source

I have the following XML Parsing code in my application:

    public static XElement Parse(string xml, string xsdFilename)
    {
        var readerSettings = new XmlReaderSettings
        {
            ValidationType = ValidationType.Schema,
            Schemas = new XmlSchemaSet()
        };
        readerSettings.Schemas.Add(null, xsdFilename);
        readerSettings.ValidationFlags |= XmlSchemaValidationFlags.ProcessInlineSchema;
        readerSettings.ValidationFlags |= XmlSchemaValidationFlags.ProcessSchemaLocation;
        readerSettings.ValidationFlags |= XmlSchemaValidationFlags.ReportValidationWarnings;
        readerSettings.ValidationEventHandler +=
            (o, e) => { throw new Exception("The provided XML does not validate against the request's schema."); };

        var readerContext = new XmlParserContext(null, null, null, XmlSpace.Default, Encoding.UTF8);

        return XElement.Load(XmlReader.Create(new StringReader(xml), readerSettings, readerContext));
    }

I am using it to parse strings sent to my WCF service into XML documents, for custom deserialization.

It works fine when I read in files and send them over the wire (the request); I've verified that the BOM is not sent across. In my request handler I'm serializing a response object and sending it back as a string. The serialization process adds a UTF-8 BOM to the front of the string, which causes the same code to break when parsing the response.

System.Xml.XmlException : Data at the root level is invalid. Line 1, position 1.

In the research I've done over the last hour or so, it appears that XmlReader should honor the BOM. If I manually remove the BOM from the front of the string, the response xml parses fine.

Am I missing something obvious, or at least something insidious?

EDIT: Here is the serialization code I'm using to return the response:

private static string SerializeResponse(Response response)
{
    var output = new MemoryStream();
    var writer = XmlWriter.Create(output);
    new XmlSerializer(typeof(Response)).Serialize(writer, response);
    var bytes = output.ToArray();
    var responseXml = Encoding.UTF8.GetString(bytes);
    return responseXml;
}

If it's just a matter of the xml incorrectly containing the BOM, then I'll switch to

var responseXml = new UTF8Encoding(false).GetString(bytes);

but it was not clear at all from my research that the BOM was illegal in the actual XML string; see e.g. c# Detect xml encoding from Byte Array?

Answer

jonp picture jonp · Jun 23, 2010

In my request handler I'm serializing a response object and sending it back as a string. The serialization process adds a UTF-8 BOM to the front of the string, which causes the same code to break when parsing the response.

So you want to prevent the BOM from being added as part of your serialization process. Unfortunately, you don't provide what your serialization logic is.

What you should do is provide a UTF8Encoding instance created via the UTF8Encoding(bool) constructor to disable generation of the BOM, and pass this Encoding instance to whichever methods you're using which are generating your intermediate string.