Howto let the SAX parser determine the encoding from the xml declaration?

Allan picture Allan · Aug 14, 2010 · Viewed 29.1k times · Source

I'm trying to parse xml files from different sources (over which I have little control). Most of the them are encoded in UTF-8 and don't cause any problems using the following snippet:

SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
FeedHandler handler = new FeedHandler();
InputSource is = new InputSource(getInputStream());
parser.parse(is, handler);

Since SAX defaults to UTF-8 this is fine. However some of the documents declare:

<?xml version="1.0" encoding="ISO-8859-1"?>

Even though ISO-8859-1 is declared SAX still defaults to UTF-8. Only if I add:

is.setEncoding("ISO-8859-1");

Will SAX use the correct encoding.

How can I let SAX automatically detect the correct encoding from the xml declaration without me specifically setting it? I need this because I don't know before hand what the encoding of the file will be.

Thanks in advance, Allan

Answer

Jarekczek picture Jarekczek · Sep 4, 2012

Use InputStream as argument to InputSource when you want Sax to autodetect the encoding.

If you want to set a specific encoding, use Reader with a specified encoding or setEncoding method.

Why? Because autodetection encoding algorithms require raw data, not converted to characters.

The question in the subject is: How to let the SAX parser determine the encoding from the xml declaration? I found Allan's answer to the question misleading and I provided the alternative one, based on Jörn Horstmann's comment and my later experience.