General Computing
compression zip tar 7-zip archiving
Updated Wed, 18 May 2022 04:51:05 GMT

Do 7z archives compress each file individually or compress everything together as one?


I read somewhere (I can't remember where) that compressing files into a .tar.xz resulted in better overall compression ratios than .7z archives because archive+compression formats like zip, rar, and 7z compress each file individually, while if you create a tar archive and then compress it with a single-file algorithm like gzip, bzip2, or xz you're running the compression algorithm over the whole combined set of data (thus allowing you to better de-duplicate data shared between multiple files).

Since I have a bunch of folders with many duplicated files that I need to compress and store somewhere, I was wondering to what extent this anecdote was actually true, and what the best format to use for this type of situation is in general.




Solution

By default as long as you use the 7z format then yes it will be what is known as a "solid" archive instead of simply a group of separately compressed files.

In fact 7zip (the program) can, if configured, go a step further than archives such as tar.bz and others and can be set to actively scan the directories to compress and group files with similar extensions together as they are more likely to have similar data in them. As a result the compression can be slightly better than tar.bz because that simply concatenates the stream of files as it finds them and may result in a less optimal group of files, particularly if the compression dictionary is small.

From the Why 7z archives created by new version of 7-Zip can be larger than archives created by old version of 7-Zip? section of their FAQ

New versions of 7-Zip (starting from version 15.06) use another file sorting order by default for solid 7z archives.

Old version of 7-Zip (before version 15.06) used file sorting "by type" ("by extension").

New version of 7-Zip supports two sorting orders:

  • sorting by name - default order.
  • sorting by type, if 'qs' is specified inParametersfield in "Add to archive" window, (or-mqsswitch for command line version).

You can get big difference in compression ratio for different sorting methods, if dictionary size is smaller than total size of files. If there are similar files in different folders, the sorting "by type" can provide better compression ratio in some cases.

As mentioned by DanielB in a comment you can actually configure the solid block size to your requirements as well, reverting to an old style "non-solid" archive containing individually compressed files with no interdependence on compressed data, or you can set it to be fully solid or use various block sizes depending on your data:

enter image description here





Comments (2)

  • +1 – A minor correction: .7z does not imply solid compression. It can be solid, solid blocks or non-solid. The default is solid blocks. — May 02, 2019 at 08:26  
  • +0 – @DanielB thanks for the prod. I had originally thought that it was configurable but my quick Google drew a blank as I was using the wrong search terms and you gave me the right one. I've added the information on how to set the block size to my answer. — May 02, 2019 at 08:38  


External Links

External links referenced by this document: