I'm using Java's DocumentBuilder.parse(InputStream)
to parse an XML document. Occasionally, I get malformed XML documents in that there is extra junk after the final >
that causes a SAXException: Content is not allowed in trailing section
. (In the cases I've seen, the junk is simply one or more null bytes.)
I don't care what's after the final >
. Is there an easy way to parse an entire XML document in Java and have it ignore any trailing junk?
Note that by "ignore" I don't simply mean to catch and ignore the exception: I mean to ignore the trailing junk, throw no exception, and to return the Document
object since the XML up to an including the final >
is valid.
Since your sender is presenting you with invalid XML, it needs to be corrected before it hits the parser if you want to avoid this exception. If you can't correct the sender, you'll need a preprocessing step of some sort.
If the situation is simply that you've got extra null bytes after the closing tag as indeicated by one of your responses to another answer, this might be something you can accomplish easily by wrapping your input stream in a FilterInputStream
that you implement to skip null bytes.
If the problem is more complex than just null characters, you'll of course need a more complex filter, which might be difficult.
If you're using a ContentHandler
, you can add a callback to it so that it can inform the calling code when the ending root tag has been handled, and based on that knowledge, the calling code can have logic in its handler for the exception to simply ignore it if the end has been signalled.
At that point anything that had to be done by the parser has likely been done anyway! But this solution doesn't seem to apply for your situation.