Validating XML doc results in "Invalid byte 1 of 1-byte UTF-8 sequence."

Pops picture Pops · Dec 4, 2012 · Viewed 7k times · Source

I'm validating some XML files against Schematron stylesheets by using Probatron4j, which uses Saxon internally. Most of the time, this works fine, but occasionally, processing crashes with the error

org.xml.sax.SAXParseException: Invalid byte 1 of 1-byte UTF-8 sequence.

My research has shown that this message typically indicates (in no particular order)

  • blatantly invalid data (e.g. attempting to read a ZIP file as if it were an XML file);
  • the presence of byte order marks;
  • the presence of characters that are not legal in UTF-8; or
  • a document that is lying when it claims to be UTF-8-encoded.

None of these applies to the document I'm processing. I've inspected the input in byte array form during program execution, and it doesn't contain a BOM or any non-ASCII characters.

Processing gets about a fifth of the way through my 30kb doc before crashing on an unremarkable English sentence (by "unremarkable," I mean that all bytes are between 32 (space) and 122 (lowercase z); in other words, standard keyboard characters). The bytes of the supposedly offending element are at the end of this post.

Oddly, the failing document was generated by removing a few elements from a larger document that gets processed cleanly by the same code.

I know that the exception is being thrown in the parse(InputSource input) method of an object that implements the org.xml.saxXMLReader interface. According to the Javadoc, SAXException indicates

Any SAX exception, possibly wrapping another exception.

Examining the exception in a debugger shows that there is no wrapped exception.

What could be causing this error?

EDIT:

[60, 80, 97, 114, 97, 103, 114, 97, 112, 104, 62, 69, 120, 101, 99, 117, 116,
 105, 118, 101, 32, 83, 117, 109, 109, 97, 114, 121, 58, 32, 70, 114, 111, 109,
 32, 49, 55, 53, 52, 32, 116, 111, 32, 49, 55, 54, 51, 13, 10, 32, 32, 32, 32,
 32, 32, 32, 32, 32, 32, 32, 32, 69, 117, 114, 111, 112, 101, 32, 97, 110, 100,
 32, 116, 104, 101, 32, 65, 109, 101, 114, 105, 99, 97, 115, 32, 119, 101, 114,
 101, 32, 99, 97, 117, 103, 104, 116, 32, 117, 112, 32, 105, 110, 32, 97, 32, 99,
 111, 110, 102, 108, 105, 99, 116, 32, 98, 101, 116, 119, 101, 101, 110, 32, 69,
 110, 103, 108, 97, 110, 100, 44, 32, 117, 110, 100, 101, 114, 32, 75, 105, 110,
 103, 32, 71, 101, 111, 114, 103, 101, 32, 73, 73, 44, 32, 97, 110, 100, 32, 70,
 114, 97, 110, 99, 101, 44, 32, 117, 110, 100, 101, 114, 32, 75, 105, 110, 103,
 32, 76, 111, 117, 105, 115, 32, 88, 86, 46, 32, 73, 110, 32, 69, 117, 114, 111,
 112, 101, 13, 10, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 116, 104, 105,
 115, 32, 112, 101, 114, 105, 111, 100, 32, 119, 97, 115, 32, 107, 110, 111, 119,
 110, 32, 97, 115, 32, 116, 104, 101, 32, 83, 101, 118, 101, 110, 32, 89, 101,
 97, 114, 115, 39, 32, 87, 97, 114, 59, 32, 105, 110, 32, 78, 111, 114, 116, 104,
 32, 65, 109, 101, 114, 105, 99, 97, 32, 105, 116, 32, 99, 97, 109, 101, 32, 116,
 111, 32, 98, 101, 32, 99, 97, 108, 108, 101, 100, 32, 116, 104, 101, 32, 70,
 114, 101, 110, 99, 104, 32, 97, 110, 100, 32, 73, 110, 100, 105, 97, 110, 32,
 87, 97, 114, 46, 32, 73, 116, 32, 119, 97, 115, 32, 97, 32, 99, 111, 110, 102,
 108, 105, 99, 116, 32, 111, 118, 101, 114, 13, 10, 32, 32, 32, 32, 32, 32, 32,
 32, 32, 32, 32, 32, 116, 114, 97, 100, 101, 32, 97, 110, 100, 32, 108, 97, 110,
 100, 46, 60, 47, 80, 97, 114, 97, 103, 114, 97, 112, 104, 62]

The exception is thrown after the third appearance of 109.

Answer

Pops picture Pops · Dec 4, 2012

I've sort-of solved this. Even though Java uses UTF-8 internally for its String objects, the String class's getBytes() method will produce bytes in the system's default encoding unless you explicitly specify that you want UTF-8 (or some other encoding scheme that it understands).

I'm not completely sure how or why this solves the problem, since the bytes near the spot where the exception was thrown — the ones at the end of the question — were all valid UTF-8 bytes on their own, but it does seem to have fixed things.

The only potential cause I can think of for this is that I missed an invalid byte earlier in the file that screwed things up but didn't cause an immediate crash. I'm reading the bytes from a ByteArrayInputStream, so it's possible that the program read a big chunk from the buffer all at once, which set the pos marker to a spot beyond where the hypothetical bad character was located.