How to find a good/optimal dictionary for zlib 'setDictionary' when processing a given set of data?

Miro picture Miro · Jan 6, 2010 · Viewed 7.9k times · Source

I have a (huge) set of similar data files. The set is constantly growing. The size of a single file is about 10K. Each file must be compressed on its own. The compression is done with the zlib library, which is used by the java.util.zip.Deflater class. When passing a dictionary to the Deflate algorithm using setDictionary, I can improve the compression ratio.

Is there a way (algorithm) to find the 'optimal' dictionary, i.e. a dictionary with the overall optimal compression ratio?

See zlib manual

Answer

alecco picture alecco · Feb 28, 2010

John Reiser explained on comp.compression:

For the dictionary: make a histogram of short substrings, sort by payoff (number of occurrences times number of bits saved when compressed) and put the highest-payoff substrings into the dictionary. For example, if k is the length of the shortest substring that can be compressed (usually 3==k or 2==k), then make a histogram of all the substrings of lengths k, 1+k, 2+k, and 3+k. Of course there is some art to placing those substrings into the dictionary, taking advantage of substrings, overlapping, short strings nearer to the high-address end, etc.

The Linux kernel uses a similar technique to compress names of symbols that are used for printing backtraces of the subroutine calling stack. See the file scripts/kallsyms.c. For instance, https://code.woboq.org/linux/linux/scripts/kallsyms.c.html

The zlib manual recommends to place the most common ocurrences at the end of the dictionary.

The dictionary should consist of strings (byte sequences) that are likely to be encountered later in the data to be compressed, with the most commonly used strings preferably put towards the end of the dictionary. Using a dictionary is most useful when the data to be compressed is short and can be predicted with good accuracy; the data can then be compressed better than with the default empty dictionary.

This is because LZ77 has a sliding window algorithm, so the later substrings will be reachable further on your stream of data than the first few.

I'd play with generating the dictionary with a higher level language with good support of strings. A crude JavaScript example:

var str = "The dictionary should consist of strings (byte sequences) that"
    + " are likely to be encountered later in the data to be compressed,"
    + " with the most commonly used strings preferably put towards the "
    + "end of the dictionary. Using a dictionary is most useful when the"
    + " data to be compressed is short and can be predicted with good"
    + " accuracy; the data can then be compressed better than with the "
    + "default empty dictionary.";
// Extract words, remove punctuation (extra: replace(/\s/g, " "))
var words = str.replace(/[,\;.:\(\)]/g, "").split(" ").sort();
var  wcnt = [], w = "", cnt = 0; // pairs, current word, current word count
for (var i = 0, cnt = 0, w = ""; i < words.length; i++) {
    if (words[i] === w) {
        cnt++; // another match
    } else {
        if (w !== "")
            wcnt.push([cnt, w]); // Push a pair (count, word)
        cnt = 1; // Start counting for this word
        w = words[i]; // Start counting again
    }
}
if (w !== "")
    wcnt.push([cnt, w]); // Push last word
wcnt.sort(); // Greater matches at the end
for (var i in wcnt)
    wcnt[i] = wcnt[i][1]; // Just take the words
var dict = wcnt.join("").slice(-70); // Join the words, take last 70 chars

Then dict is a string of 70 chars with:

rdsusedusefulwhencanismostofstringscompresseddatatowithdictionarybethe

You can try it copy-paste-run here (add: "print(dict)")

That's just whole words, not substrings. Also there are ways to overlap common substrings to save space on the dictionary.