Binary run length encoding

avramov picture avramov · Sep 29, 2011 · Viewed 14.5k times · Source

I have a web form, for the contents of which I would like to generate a short representation in Base64. The form, among other things, contains a list of 264 binary values, the greater part of which are going to be 0 at any single time. (They represent regions on a geographical map). Even in Base64, this 264-bit number generates a long, intimidating string. I want to implement run-length encoding, as efficiently as possible. Can you help me with this? I've googled binary RLE, but have found nothing of use.

What I've tried this far - running RLE on the binary string using decimal counts and "A" as a separator denoting a change between 0 and 1, then converting the result from base 11 to base 64. For example:

00000000001111111000000010000000000000000000000001111111110001111010101000000000000000000000000000000000000111111111110111000000000000111111100000001000000000000000000000000111111111000111101010100000000000000000000000000000000000011111111111011100

becomes

10A5A5AA22A7A1A2AAAAAAA34A9AA1A10A5A5AA22A7A1A2AAAAAAA34A9AA1A

which in turn becomes

CNnbr/FxkgbbOw0LNAKgk65P8SdvaTG+t74o

or, in base 62,

6imo7zq1pqr2mqglTHzXwJRAksm7fvHZHWQK

It's better, but I still can't help but doubt if I'm doing something wrong - is using the digit "A" as a separator is the best way to do this?

And another update:

Thanks to @comingstorm, I have shortened the compressed string some more.

ILHHASCAASBYwwccDASYgAEgWDI=

As I mentioned it in the comments, real usage cases would generally result in an even shorter string.

Answer

comingstorm picture comingstorm · Sep 30, 2011

Since you're coding bits, you probably want to use a bit-based RLE instead of a byte-based one. In this context, you should consider Elias gamma coding (or some variant thereof) to efficiently encode your run lengths.

A reasonable first approximation for your encoding format might be:

  • first bit = same as the first bit of the uncompressed string (to set initial polarity)
  • remaining bits: Elias coded lengths of successive bit runs (alternating 1 and 0)

Since you know how many bits are in your uncompressed string, you don't need a termination code; you can just add any necessary binary padding as arbitrary bits.

Note that it is always possible for the run-length "compression" to expand your bit string; if you're concerned about this, you can add another initial bit to indicate whether your data is in compressed or uncompressed format, limiting your compression overhead to 1 bit.