How to make SAXParser ignore escape codes

Scott picture Scott · Jan 7, 2012 · Viewed 7.4k times · Source

I am writing a Java program to read and XML file, actually an iTunes library which is XML plist format. I have managed to get round most obstacles that this format throws up except when it encounters text containing the &. The XLM file represents this ampersand as & and I can only manage to read the text following the & in any particular section of text.

Is there a way to disable detection of escape codes? I am using SAXParser.

Answer

Stephen C picture Stephen C · Jan 8, 2012

There is something fishy about what you are trying to do.

If the file format you are trying to parse contains bare ampersand (&) characters then it is not well-formed XML. Ampersands are represented as character entities (e.g. &) in well-formed XML.

  • If it is really supposed to be real XML, then there is a bug in whatever wrote / generated the file.

  • If it is not supposed to be real XML (i.e. those ampersands are not a mistake), then you probably shouldn't by trying to parse it using an XML parser.


Ah, I see. The XML is actually correctly encoded, but you didn't get the SO markup right.

It would appear that your real problem is that your characters(...) callback is being called separately for the text before the &, for the (decoded) &, and finally for the text after the &. You simply have to have to deal with this by joining the text chunks back together.

The javadoc for ContentHandler.characters() says this:

"The Parser will call this method to report each chunk of character data. SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks ...".