In one of our projects we are using Kafka with AVRO to transfer data across applications. Data is added to an AVRO object and object is binary encoded to write to Kafka. We use binary encoding as it is generally mentioned as a minimal representation compared to other formats.
The data is usually a JSON string and when it is saved in a file, it uses up to 10 Mb of disk. However, when the file is compressed (.zip), it uses only few KBs. We are concerned storing such data in Kafka, so trying to compress before writing to a Kafka topic.
When length of binary encoded message (i.e. length of byte array) is measured, it is proportional to the length of the data string. So I assume binary encoding is not reducing any size.
Could someone tell me if binary encoding compresses data? If not, how can I apply compression?
Thanks!
If binary encoding compresses data?
Yes and no, it depends on your data.
According to avro binary encoding, yes for it only stores the schema once for each .avro
file, regardless how many datas in that file, hence save some space w/o storing JSON's key name many times. And avro serialization do a bit compression with storing int and long leveraging variable-length zig-zag coding(only for small values). For the rest, avro don't "compress" data.
No for in some extreme case avro serialized data could be bigger than raw data. Eg. one .avro
file with one Record
in which only one string field. The schema overhead can defeat the saving from don't need to store the key name.
If not, how can I apply compression?
According to avro codecs, avro has built-in compression codec and optional ones. Just add one line while writing object container files :
DataFileWriter.setCodec(CodecFactory.deflateCodec(6)); // using deflate
or
DataFileWriter.setCodec(CodecFactory.snappyCodec()); // using snappy codec
To use snappy
you need to include snappy-java
library into your dependencies.