What is the best way to load XML from a byte array or string with a document specification, as obtained from an OpenDocument ODT file?

Roland picture Roland · Aug 21, 2014 · Viewed 14.7k times · Source

(NB: the original question title was: What is the best way to load XML from a string with a document specification?)

I need to get the XML content from an ODT opendocument (LibreOffice) file in an XmlDocument object. The ODT is a zip archive and I managed to get the content.xml part as a byte array. Converting to a string seems simple, but I was surprised to find that XmlDocument.LoadXml(string) does not accept a string that starts with an Xml document specification line, like:

<?xml version="1.0" encoding="UTF-8"?>
<Offices id="0" enabled="false">
  <office />
</Offices>

The exception is: Data at the root level is invalid. Line 1, position 1

I wonder if there is a library call to read such a string?

For now I use this function I improvised, but it feels unnecessarily complex to have to do stuff on the character level when handling xml documents:

    /// <summary>
    /// Convert an Xml document in a string, including document specification line(s),
    /// to an XmlDocument object
    /// </summary>
    /// <param name="XmlString"></param>
    /// <returns></returns>
    public static XmlDocument LoadXmlString(string XmlString)
    {
        XmlDocument XmlDoc = new XmlDocument();
        XmlDoc.LoadXml(XmlString.Substring(XmlString.LastIndexOf("?>") + 2));
        return XmlDoc;
    }

Is there a better way?

NB: I refer to this earlier question

but this addresses the problem of parsing a string, with the solution of converting the string to a byte array, while I should not be parsing the string, and not convert the byte array to string to begin with, but just skip this step and directly parse the byte array after unzipping the ODT.

Answer

Roland picture Roland · Aug 21, 2014

With the new, more precise question title, the answer can be very simple:

just convert the unzipped byte array to XML without converting to a string first.

Simple, and no risk of encoding issues.

The background is that the content.xml part of an ODT file is not a string, but an XML document. LibreOffice zipped the Xml to the ODT archive, without first converting the XML to a string. The unzipping function does not know what is in the zipped data, and just unzips the compressed bytes to uncompressed bytes. The XmlDocument.Load() function does not care about the string representation, but learns from the document specification line in the data which encoding is applicable to parse the byte array to XML.


my original answer:

As I learned from the (deleted) post of Donal: the reason that is failing is because .Net strings are encoded with UTF-16 and your specification specifies UTF-8. As I actually started from a byte array, I should NOT try to make string with:

  string s = Encoding.UTF8.GetString(Bytes);

because this string cannot be accepted by LoadXml().

Instead I need Donal's solution code, simplified to:

    public XmlDocument GetEntryXmlDoc(byte[] Bytes)
    {
        XmlDocument xmlDoc = new XmlDocument();
        using (MemoryStream ms = new MemoryStream(Bytes))
        {
            xmlDoc.Load(ms);
        }
        return xmlDoc;
    }

I would like to refer to the earlier post mentioned by others, but I could not easily find the answer to my problem there, which is my fault, also because of impatience because I just found the answer here.