Which algorithm is used in standard ZIP?

user1131997 picture user1131997 · Apr 18, 2012 · Viewed 13.9k times · Source

I have googled, wikied and read the RFC of ZIP, but can't find any info about the exact algorithm which is used in ZIP.

I have found info about ZIP == TAR + GZIP

But, I'm confused by this info.

Since GZIP uses LZW algorithm as I remember, and TAR uses LZMA, I can't imagine how it could be that ZIP == TAR + GZIP (LZMA + LZW - ???)

Could you help me with finding the algorithm of ZIP? I want to implement it.

Answer

Jerry Coffin picture Jerry Coffin · Apr 18, 2012

Zip provides capabilities roughly equivalent to the combination of tar with gzip.

tar just collects a number of files together into a single file, preserving information about the original files (e.g., paths, dates). Contrary to the statement in the question, it does no compression by itself.

gzip just takes a single file and compresses it.

Zip does both of those -- i.e., it stores a number of constituent files into an archive (again, preserving things like paths, dates, etc.), and compresses them. Unlike tar + gzip, it compresses each file individually, and leaves the "directory" information about the constituent files un-compressed. This makes it easy to work with individual files in the archive (insert, delete, decompress, etc.) but also means that it usually won't get as good of compression overall.

Rather than re-implementing zip's compression algorithm, you're almost certainly better off downloading the code (extremely portable, very liberal license) from the zlib web site. The zlib web site does have a fairly reasonable explanation of the algorithms. If you really insist on doing this yourself, you probably also want to look at RFC 1950, 1951, and 1952.