Most compact way to encode a sequence of random variable length binary codes?

Pyrolistical picture Pyrolistical · Jan 29, 2010 · Viewed 14.1k times · Source

Let's say you have a List<List<Boolean>> and you want to encode that into binary form in the most compact way possible.

I don't care about read or write performance. I just want to use the minimal amount of space. Also, the example is in Java, but we are not limited to the Java system. The length of each "List" is unbounded. Therefore any solution that encodes the length of each list must in itself encode a variable length data type.

Related to this problem is encoding of variable length integers. You can think of each List<Boolean> as a variable length unsigned integer.

Please read the question carefully. We are not limited to the Java system.

EDIT

I don't understand why a lot of the answers talk about compression. I am not trying to do compression per se, but just encoding random sequence of bits down. Except each sequence of bits are of different lengths and order needs to be preserved.

You can think of this question in a different way. Lets say you have a list of arbitrary list of random unsigned integers (unbounded). How do you encode this list in a binary file?

Research

I did some reading and found what I really am looking for is Universal code

Result

I am going to use a variant of Elias Omega Coding described in the paper A new recursive universal code of the positive integers

I now understand how the smaller the representation of the smaller integers is a trade off with the larger integers. By simply choosing an Universal code with a "large" representation of the very first integer you save a lot of space in the long run when you need to encode the arbitrary large integers.

Answer

recursive picture recursive · Feb 2, 2010

I am thinking of encoding a bit sequence like this:

head  | value
------+------------------
00001 | 0110100111000011

Head has variable length. Its end is marked by the first occurrence of a 1. Count the number of zeroes in head. The length of the value field will be 2 ^ zeroes. Since the length of value is known, this encoding can be repeated. Since the size of head is log value, as the size of the encoded value increases, the overhead converges to 0%.

Addendum

If you want to fine tune the length of value more, you can add another field that stores the exact length of value. The length of the length field could be determined by the length of head. Here is an example with 9 bits.

head  | length | value
------+--------+-----------
00001 | 1001   | 011011001