The file in question is not under my control. Most byte sequences are valid UTF-8, it is not ISO-8859-1 (or an other encoding). I want to do my best do extract as much information as possible.
The file contains a few illegal byte sequences, those should be replaces with the replacement character.
It's not an easy task, it think it requires some knowledge about the UTF-8 state machine.
Oracle has a wrapper which does what I need:
UTF8ValidationFilter javadoc
Is there something like that available (commercially or as free software)?
Thanks
-stephan
Solution:
final BufferedInputStream in = new BufferedInputStream(istream);
final CharsetDecoder charsetDecoder = StandardCharsets.UTF_8.newDecoder();
charsetDecoder.onMalformedInput(CodingErrorAction.REPLACE);
charsetDecoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
final Reader inputReader = new InputStreamReader(in, charsetDecoder);
java.nio.charset.CharsetDecoder does what you need. This class provides charset decoding with user-definable actions on different kinds of errors (see onMalformedInput()
and onUnmappableCharacter()
).
CharsetDecoder
writes to an OutputStream
, which you can pipe into an InputStream
using java.io.PipedOutputStream
, effectively creating a filtered InputStream
.