Compressing a folder with many duplicated files

user972014 picture user972014 · Dec 13, 2014 · Viewed 9.9k times · Source

I have a pretty big folder (~10GB) that contains many duplicated files throughout it's directory tree. Many of these files are duplicated up 10 times. The duplicated files don't reside side by side, but within different sub-directories.

How can I compress the folder to a make it small enough?

I tried to use Winrar in "Best" mode, but it didn't compress it at all. (Pretty strange)

Will zip\tar\cab\7z\ any other compression tool do a better job?

I don't mind letting the tool work for a few hours - but not more.

I rather not do it programmatically myself

Answer

Ara Saahov picture Ara Saahov · Oct 12, 2018

Best options in your case is 7-zip. Here is the options:

7za a -r -t7z -m0=lzma2 -mx=9 -mfb=273 -md=29 -ms=8g -mmt=off -mmtf=off -mqs=on -bt -bb3 archife_file_name.7z /path/to/files

a - add files to archive

-r - Recurse subdirectories

-t7z - Set type of archive (7z in your case)

-m0=lzma2 - Set compression method to LZMA2. LZMA is default and general compression method of 7z format. The main features of LZMA method:

  • High compression ratio
  • Variable dictionary size (up to 4 GB)
  • Compressing speed: about 1 MB/s on 2 GHz CPU
  • Decompressing speed: about 10-20 MB/s on 2 GHz CPU
  • Small memory requirements for decompressing (depend from dictionary size)
  • Small code size for decompressing: about 5 KB
  • Supporting multi-threading and P4's hyper-threading

-mx=9 - Sets level of compression. x=0 means Copy mode (no compression). x=9 - Ultra

-mfb=273 - Sets number of fast bytes for LZMA. It can be in the range from 5 to 273. The default value is 32 for normal mode and 64 for maximum and ultra modes. Usually, a big number gives a little bit better compression ratio and slower compression process.

-md=29 - Sets Dictionary size for LZMA. You must specify the size in bytes, kilobytes, or megabytes. The maximum value for dictionary size is 1536 MB, but 32-bit version of 7-Zip allows to specify up to 128 MB dictionary. Default values for LZMA are 24 (16 MB) in normal mode, 25 (32 MB) in maximum mode (-mx=7) and 26 (64 MB) in ultra mode (-mx=9). If you do not specify any symbol from the set [b|k|m|g], the dictionary size will be calculated as DictionarySize = 2^Size bytes. For decompressing a file compressed by LZMA method with dictionary size N, you need about N bytes of memory (RAM) available.

I use md=29 because on my server there is 16Gb only RAM available. using this settings 7-zip takes only 5Gb on any directory size archiving. If I use bigger dictionary size - system goes to swap.

-ms=8g - Enables or disables solid mode. The default mode is s=on. In solid mode, files are grouped together. Usually, compressing in solid mode improves the compression ratio. In your case this is very important to make solid block size as big as possible.

Limitation of the solid block size usually decreases compression ratio. The updating of solid .7z archives can be slow, since it can require some recompression.

-mmt=off - Sets multithreading mode to OFF. You need to switch it off because we need similar or identical files to be processed by same 7-zip thread in one soled block. Drawback is slow archiving. Does not matter how many CPUs or cores your system have.

-mmtf=off - Set multithreading mode for filters to OFF.

-myx=9 - Sets level of file analysis to maximum, analysis of all files (Delta and executable filters).

-mqs=on - Sort files by type in solid archives. To store identical files together.

-bt - show execution time statistics -bb3 - set output log level