How do I load an org.w3c.dom.Document from XML in a string?

Frank Krueger picture Frank Krueger · Aug 28, 2008 · Viewed 129k times · Source

I have a complete XML document in a string and would like a Document object. Google turns up all sorts of garbage. What is the simplest solution? (In Java 1.5)

Solution Thanks to Matt McMinn, I have settled on this implementation. It has the right level of input flexibility and exception granularity for me. (It's good to know if the error came from malformed XML - SAXException - or just bad IO - IOException.)

public static org.w3c.dom.Document loadXMLFrom(String xml)
    throws org.xml.sax.SAXException, java.io.IOException {
    return loadXMLFrom(new java.io.ByteArrayInputStream(xml.getBytes()));
}

public static org.w3c.dom.Document loadXMLFrom(java.io.InputStream is) 
    throws org.xml.sax.SAXException, java.io.IOException {
    javax.xml.parsers.DocumentBuilderFactory factory =
        javax.xml.parsers.DocumentBuilderFactory.newInstance();
    factory.setNamespaceAware(true);
    javax.xml.parsers.DocumentBuilder builder = null;
    try {
        builder = factory.newDocumentBuilder();
    }
    catch (javax.xml.parsers.ParserConfigurationException ex) {
    }  
    org.w3c.dom.Document doc = builder.parse(is);
    is.close();
    return doc;
}

Answer

erickson picture erickson · Aug 29, 2008

Whoa there!

There's a potentially serious problem with this code, because it ignores the character encoding specified in the String (which is UTF-8 by default). When you call String.getBytes() the platform default encoding is used to encode Unicode characters to bytes. So, the parser may think it's getting UTF-8 data when in fact it's getting EBCDIC or something… not pretty!

Instead, use the parse method that takes an InputSource, which can be constructed with a Reader, like this:

import java.io.StringReader;
import org.xml.sax.InputSource;
…
        return builder.parse(new InputSource(new StringReader(xml)));

It may not seem like a big deal, but ignorance of character encoding issues leads to insidious code rot akin to y2k.