I have some large base64 encoded data (stored in snappy files in the hadoop filesystem). This data was originally gzipped text data. I need to be able to read chunks of this encoded data, decode it, and then flush it to a GZIPOutputStream.
Any ideas on how I could do this instead of loading the whole base64 data into an array and calling Base64.decodeBase64(byte[]) ?
Am I right if I read the characters till the '\r\n' delimiter and decode it line by line? e.g. :
for (int i = 0; i < byteData.length; i++) {
if (byteData[i] == CARRIAGE_RETURN || byteData[i] == NEWLINE) {
if (i < byteData.length - 1 && byteData[i + 1] == NEWLINE)
i += 2;
else
i += 1;
byteBuffer.put(Base64.decodeBase64(record));
byteCounter = 0;
record = new byte[8192];
} else {
record[byteCounter++] = byteData[i];
}
}
Sadly, this approach doesn't give any human readable output. Ideally, I would like to stream read, decode, and stream out the data.
Right now, I'm trying to put in an inputstream and then copy to a gzipout
byteBuffer.get(bufferBytes);
InputStream inputStream = new ByteArrayInputStream(bufferBytes);
inputStream = new GZIPInputStream(inputStream);
IOUtils.copy(inputStream , gzipOutputStream);
And it gives me a java.io.IOException: Corrupt GZIP trailer
Let's go step by step:
You need a GZIPInputStream
to read zipped data (that and not a GZIPOutputStream
; the output stream is used to compress data). Having this stream you will be able to read the uncompressed, original binary data. This requires an InputStream
in the constructor.
You need an input stream capable of reading the Base64 encoded data. I suggest the handy Base64InputStream
from apache-commons-codec. With the constructor you can set the line length, the line separator and set doEncode=false
to decode data. This in turn requires another input stream - the raw, Base64 encoded data.
This stream depends on how you get your data; ideally the data should be available as InputStream
- problem solved. If not, you may have to use the ByteArrayInputStream
(if binary), StringBufferInputStream
(if string) etc.
Roughly this logic is:
InputStream fromHadoop = ...; // 3rd paragraph
Base64InputStream b64is = // 2nd paragraph
new Base64InputStream(fromHadoop, false, 80, "\n".getBytes("UTF-8"));
GZIPInputStream zis = new GZIPInputStream(b64is); // 1st paragraph
Please pay attention to the arguments of Base64InputStream
(line length and end-of-line byte array), you may need to tweak them.