wimlib - Compression (2024)

Table of Contents

XPRESS LZX LZMS Benchmarks FAQs References

wimlib

the open source Windows Imaging (WIM) library

Main page
Downloads
Forums
Browse source
Compression algorithms

wimlib supports compression and decompression in all of the compressionformats known to be used in WIM archives:

XPRESS
LZX
LZMS

wimlib's compressors for these formats usually outperform and outcompresstheir closed-source Microsoft equivalents. See the benchmarks.

Since these formats are not well known outside of Microsoft, someextra information is provided below.

XPRESS

The Huffman variant of the XPRESS compression format uses LZ77-style dictionary compression combined with Huffman coding. It isdesigned for fast compression and decompression with a small dictionary size(typically 4096 to 65,536 bytes). The uncompressed data is represented as asequence of literals and matches. A literal is a single byte, whereas a matchconsists of a (length, offset) pair which tells the decompressor to copy 3 ormore bytes from data it has already decompressed. A single alphabet of 512symbols combines literals (256 possible values) and match headers (256 morepossible values) and is Huffman-coded.

Example uses of XPRESS:

WIM archives. To create a WIM archive that uses XPRESS compression, use: /compress:fast with DISM, or --compress=fast or --compress=xpress with wimlib-imagex.
System compression (a.k.a. Windows 10 file compression or "CompactOS")
Windows compression API

wimlib's XPRESScompressor supports different compression levels (the default is level 50):

Very fast (level < 30): Greedy parsing with matches found using HC4 matchfinder (hash chains with 4 bytes hashing, plus a table with 3 bytes hashing). The compression ratio is similar to that of Microsoft's XPRESS compressor.
Fast (30 ≤ level < 60): Lazy parsing with matches found using HC4 matchfinder (hash chains with 4 bytes hashing, plus a table with 3 bytes hashing). The compression ratio is better than that of Microsoft's XPRESS compressor.
Normal/slow (level ≥ 60): "Near-optimal" parsing with matches found using BT4 matchfinder (binary trees with 4 bytes hashing, plus a table with 3 bytes hashing). The compression ratio is much better than that of Microsoft's XPRESS compressor. The techniques used are similar to those used in wimlib's LZX compressor; see LZX for more details.

LZX

The LZX format is basically a slightly improved version of the more commonlyused DEFLATE format (theformat used in ZIP files, and by zlib and gzip). It combines LZ77-style dictionary compression with Huffman coding and isintended to be used with a small-to-medium dictionary size (32,768 to 2,097,152bytes). Uncompressed data is represented a sequence of literals and matches.Literals and match headers are combined into a single Huffman-coded "mainalphabet". The lengths of long matches are Huffman-coded using the "lengthalphabet". The compressor can choose to divide the input into multiple"blocks", each of which uses separate Huffman codes. In addition, the LZXformat includes repeat offset codes, the option to entropy-encode the bottomthree bits of match offsets, and x86 machine code preprocessing. Thesecharacteristics allow "binary" data to be compressed better with LZX than withDEFLATE. Other improvements over DEFLATE are: no coding space is wasted on anend-of-block symbol; and the main alphabet captures length-offsetcorrelation.

Example uses of LZX:

WIM archives. To create a WIM archive that uses LZX compression, use: /compress:maximum with DISM, or --compress=maximum or --compress=lzx with wimlib-imagex, or simply take advantage of the fact that wimlib-imagex defaults to LZX compression.
Windows 10 "compact mode" (System compression)
Cabinet files
Chm files
Xbox live avatars

wimlib's LZXcompressor supports different compression levels (the default is level 50):

Fast (level < 35): Lazy parsing with matches found using HC4 matchfinder (hash chains with 4 bytes hashing, plus a table with 3 bytes hashing) and repeat offset match searching. The compression ratio is worse than that of Microsoft's LZX compressor, but the speed is quite fast — almost as fast as wimlib's fast mode for XPRESS.
Normal/slow (level ≥ 35): "Near-optimal" parsing with matches found using BT4 matchfinder (binary trees with 4 bytes hashing, plus tables with 3 and 2 bytes hashing), along with repeat offset match searching. The compression ratio is better than that of Microsoft's LZX compressor, while still being significantly faster. Because this is the default compression mode of wimlib-imagex, this is the most carefully engineered compressor available in wimlib. However, the LZX format does not allow as high a compression ratio as LZMS if a sufficiently large dictionary is used.

The compressor starts by preprocessing the input buffer. Then, it advances through the buffer and finds LZ77 matches. For each byte position, the 2, 3, and 4-byte sequences beginning at that position are hashed using a multiplicative hash function. The 2 and 3-byte hashes identify entries in the hash2 and hash3 tables, respectively. If either entry contains a position value, that position is checked for a length 2 or 3 match with the current position, respectively; then the entry is replaced with the current position. Assuming no hash collisions, this finds the closest length 2 and length 3 matches. Then, the 4-byte hash identifies a binary tree containing prior sequences (or positions). The sequence beginning with the 4-byte sub-sequence is searched for in the tree, and matches are reported along the search path. At the same time, the binary tree is re-rooted at the current sequence, which ensures that small offsets remain close to the root of the tree.

At some point, the compressor will decide it is time to end the current block. This happens when either the end of the input is reached, a maximum block size is reached, or when a heuristic determines that the type of data being compressed has changed significantly. Once this decision is made, the compressor runs a minimum-cost path graph search algorithm to find a concise compressed representation of the block. The (conceptual) graph being searched consists of a node per uncompressed byte and an edge per match or literal which can represent the sequence of uncompressed data between the two connected nodes. The cost of each edge is the number of bits required to represent that match or literal. A path through this graph therefore represents a valid compressed representation of the block's data with some cost, and we want the one with the minimum cost (or at least a low cost). The minimum-cost path algorithm can be thought of as similar to Dijkstra's algorithm, but it does not need to be as general (no priority queue is needed) because the graph is already topologically sorted by virtue of nodes being associated with byte positions.

At the default compression level (50), the minimum-cost path algorithm is actually run twice. The first pass uses edge costs derived from a mix of static default values and statistics collected about the block, whereas the second pass uses edge costs derived from the Huffman codes that were computed after the first pass. At higher compression levels, more than two passes are used for greater accuracy. This is effective in practice but is not optimal because the final Huffman codeword lengths cannot be predicted exactly, and more generally it is unknown which codeword lengths would actually produce the optimal solution.

Repeat offset matches are handled specially. Because the state of the recent offsets queue cannot be predicted in advance, repeat offset matches are searched for during the graph search step, not during the main matchfinding step. This also implies that the graph search must proceed in the forwards direction rather than in the backwards direction, since some edges aren't known in advance.

The compressor uses various heuristics to avoid an exhaustive search. For example, it only considers the smallest offset at which each distinct match length is available (e.g. if there is a length 3 match at offsets 100 and 1000, then it will only consider the one at offset 100). The compressor is also subject to cut-offs in cases where there are too many matches or the matches are very long.

Once the compressor has chosen the match/literal sequence for each block, it outputs the actual compressed bits, using Huffman encoding when required.

LZMS

LZMS is an undocumented compression format that Microsoft released in 2012or 2013. It perhaps could be described as Microsoft's answer to LZMA(i.e. the format used in 7z and xz files), although LZMS usuallyproduces a worse compression ratio than LZMA. Like LZMA, LZMS is an LZ77-basedalgorithm. It achieves a relatively high compression ratio by relying on alarge LZ77 dictionary size (up to 67,108,864 bytes) and statistically modellingthe LZ77 stream of literals and matches. Unlike LZMA but like some LZMAcompetitors such as LZHAM, LZMS uses Huffmancoding in addition to the more concise arithmetic coding, presumably to make decompression faster. The Huffman codes arerebuilt periodically and are not stored with the compressed data. The formatincludes a preprocessing step for x86 and x86_64 machine code. It does notinclude a "delta" filter for multimedia data but rather allows a special"delta" match type in addition to the traditional LZ77 match type.

Example uses of LZMS:

Solid WIM archives (ESD files), introduced in Windows 8. All the file data in these archives is concatenated together, divided into 64 MiB chunks, and compressed using LZMS compression. To create such a file with DISM, use: /compress:recovery. To create such a file with wimlib-imagex, use: --solid.
Windows compression API. Developers can use LZMS compression in their applications. The API doesn't seem to support sliding-window compression — like in solid WIM archives, you're supposed to divide your data into blocks, and compress each block independently. The LZ77 dictionary size is the same as the block size.

wimlib's LZMScompressor supports different compression levels but isprimarily optimized for the default level (50). Similar to commonly used LZMAcompressors, wimlib's LZMS compressor performs a heuristic graph search oversmall regions of the uncompressed data to choose a concise compressedrepresentation, considering many different possibilities and choosing the onewhich requires the fewest bits. LZ77-style matches are found using a stringsearch algorithm which handles very large buffers (tens or hundreds ofmegabytes in size) relatively efficiently. First, the string search algorithmbuilds the suffix arrayof the input data using libdivsufsort by Yuta Mori.Second, the string search algorithm uses the suffix array to build a compactrepresentation of thelcp-interval tree. Third,for each input buffer position in sequential order, the string search algorithmreports a match for each lcp-interval that contains that position's rank, waspreviously visited, and has the greatest lcp value among all lcp-intervals underconsideration which were last visited equally recently. The reported matchlength is the lcp value of the lcp-interval, and the reported match offset isthe distance to the position at which that lcp-interval was last visited. More information can befound in the source code, Kasai et al (2001), Abouelhoda et al (2004), and Chen et al (2008). LZMS "Delta" matches are found using simpler data structure(a hash table).

Benchmarks

As mentioned, wimlib's compressors for XPRESS, LZX, and LZMS usuallyoutperform and outcompress their Microsoft equivalents. In addition, morecompression settings are provided. Although results will vary depending on thedata being compressed, the table below shows results for a common use case:creating an x86 Windows PE image ("boot.wim"). Each row shows the compressiontype, the size of the resulting WIM archive in bytes, and the time it took tocreate the file. When possible, the results with the Microsoft equivalent areincluded.

Compression	wimlib (v1.10.0)	WIMGAPI (Windows 10)
LZMS (solid) [1]	88,114,666 in 122.2s	88,772,520 in 201.4s
LZMS (non-solid) [2]	116,147,378 in 60.0s	N/A
LZX (slow) [3]	125,514,977 in 48.4s	N/A
LZX (normal) [4]	125,929,467 in 27.5s	127,295,034 in 45.9s
LZX (quick) [5]	129,997,269 in 4.5s	N/A
XPRESS (slow) [6]	135,147,924 in 20.8s	N/A
XPRESS [7]	137,953,928 in 3.6s	140,459,585 in 8.5s
"WIMBoot" (slow) [8]	165,005,928 in 18.5s	N/A
"WIMBoot" [9]	166,893,119 in 3.7s	169,111,375 in 10.5s
None [10]	361,315,118 in 1.4s	361,315,144 in 5.4s

Notes:

--solid for wimlib-imagex; /compress:recovery for DISM. Compression chunk size in solid resources defaults to 67108864 bytes in both cases.
--compress=LZMS for wimlib-imagex. Compression chunk size defaults to 131072 bytes.
--compress=LZX:100 for wimlib-imagex. Compression chunk size defaults to 32768 bytes.
--compress=LZX:50 or --compress=LZX or no option for wimlib-imagex; /compress:maximum for DISM. Compression chunk size defaults to 32768 bytes in both cases.
--compress=LZX:20 for wimlib-imagex. Compression chunk size defaults to 32768 bytes.
--compress=XPRESS:80 for wimlib-imagex. Compression chunk size defaults to 32768 bytes.
--compress=XPRESS:50 or --compress=XPRESS for wimlib-imagex; /compress:fast for DISM. Compression chunk size defaults to 32768 bytes in both cases.
--wimboot --compress=XPRESS:80 for wimlib-imagex. Same format as [9], but trying harder to get a good compression ratio.
--wimboot for wimlib-imagex; /wimboot for DISM. Equivalent to --compress=XPRESS --chunk-size=4096
--compress=none for wimlib-imagex; /compress:none for DISM.

wimlib-imagex's --compress option also accepts the "fast", "maximum", and"recovery" aliases for XPRESS, LZX, and LZMS, respectively.

Testing environment:

64 bit binaries
Windows 10 virtual machine running on Linux with VT-x
2 CPUs and 3 GB memory given to virtual machine
SSD-backed virtual disk
All tests done with page cache warmed

The compression ratio provided by wimlib is also competitive with commonlyused archive formats. Below are file sizes that result when the Canterburycorpus is compressed with wimlib (v1.10.0), WIMGAPI (Windows 10), and some otherformats/programs:

Format	Size (bytes)
7z (7-zip, -9)	483,239
7z (7-zip, default)	484,700
tar.xz (xz, -9)	486,904
tar.xz (xz, default)	486,916
WIM (wimlib, LZMS solid)	515,800
WIM (WIMGAPI, LZMS solid)	521,366
tar.bz2 (bzip, -9)	565,008
tar.bz2 (bzip, default)	565,008
WIM (wimlib, LZMS non-solid)	581,046
WIM (wimlib, LZX slow)	618,766
WIM (wimlib, LZX normal)	621,898
WIM (WIMGAPI, LZX)	651,866
WIM (wimlib, LZX quick)	686,420
ZIP (Info-ZIP, -9)	732,297
tar.gz (gzip, -9)	733,971
ZIP (Info-ZIP, default)	735,334
tar.gz (gzip, default)	738,796
WIM (wimlib, XPRESS)	787,356
WIM (WIMGAPI, XPRESS)	825,536
WIM (wimlib, None)	2,814,216
WIM (WIMGAPI, None)	2,814,254
tar	2,826,240

WIM does even better on directory trees containing duplicate files, which theCanterbury corpus doesn't have.

FAQs

Can you compress a Wim file? ›

wim file can have only one assigned compression type. Therefore, you can only append files with the same compression type.

Keep Reading ›

What is the best compression for DVR? ›

Compression Formats

Compression	Conclusion
H.264	Great compression where large recording times are one of the most important factors of choosing your Digital Video Recorder.
JPEG2000	Best compression when image
MPEG-4	Good all rounder.

Feb 25, 2019

Discover More Details ›

Is LZX compression good? ›

So to recap, the Win 10 Compact LZX really compresses things really well, takes a small amount of CPU/memory to uncompress on the fly transparently, and needs to be applied via the manual command line in Windows 10 only.

Tell Me More ›

Which compression is better in video surveillance? ›

H. 264 security camera video compression is the newest and most efficient security camera video compression codec. It works by taking small groups of frames and evaluating them together as a series to eliminate duplicate content that appears in each frame without changing.

Keep Reading ›

How to reduce the size of a wim file? ›

Split the install. wim file into smaller ones - preserves all editions within the ISO

Extract the required install.wim out of the ISO file or mount the ISO file.
Open an elevated command prompt to use DISM commands.
Run the following DISM command with the correct path to the install.wim and the wanted file size in MB:

More items...

How do I compress a file and reduce its size? ›

To zip (compress) a file or folder

Locate the file or folder that you want to zip.
Press and hold (or right-click) the file or folder, select (or point to) Send to, and then select Compressed (zipped) folder. A new zipped folder with the same name is created in the same location.

Learn More ›

What is the best software to highly compress files? ›

The best file compression software of 2024 in full:

WinZip. The original and best file compression tool. Our expert review: ...
WinRAR. The best file compression software for RAR files. Our expert review: ...
7-Zip. The best free file compression software. Our expert review: ...
PeaZip. Best free file compression tool for Linux.

Jul 29, 2024

Know More ›

What is the most efficient file compression format? ›

Usually, if the test conducted is straightforward the 7z file format becomes the best choice. So if you need the best compression rate possible, it is better to use the 7z compression. However, while 7z is the best choice to save space on your drive, it may not be your best choice if saving time is your end goal.

Get More Info ›

Which codec has the highest compression? ›

The best video codec for streaming depends on factors like the type of content, the streaming platform, and a user's network conditions. That said, H. 264 (AVC) is one of the best codecs due to its high compression efficiency and widespread compatibility with a range of devices and platforms.

Discover More ›

Does 7zip have better compression? ›

Free of charge — you don't need to spend a dime to use it. High compression ratio — for ZIP and GZIP formats, 7-Zip has a better compression ratio than that of PKZIP and WinZip's. Free for commercial use — 7-Zip is protected by the GNU LGPL license, thus allowing you to copy and distribute the software legally.

Read The Full Story ›

What level of compression is LZMA? ›

Comparison of compression methods and levels Print

Compression Method	Compression Level	Size of compressed file (KB)
LZMA	Normal (5)	23,465
LZMA	Maximum (9)	22,671
LZMA2	Lowest (1)	30,993
LZMA2	Normal (5)	23,468

26 more rows

Jan 23, 2024

Learn More ›

What is the best high-quality video compression? ›

H. 264 (MP4): A widely supported format that offers excellent compression without significant quality loss. H. 265 (HEVC): An even more efficient format that provides high compression while maintaining good quality.

Explore More ›

Is H 265 better than H 264? ›

H. 265 is more efficient than H. 264, offering better compression or higher quality at the same bitrate. It supports higher resolutions and frame rates, making it ideal for 4K and HD content.

Get More Info Here ›

What is the best video compression algorithm? ›

As for video files, MPEG-4 is a lossy algorithm producing high compression ratios and good quality for most types, while H. 264 offers better quality and efficiency than MPEG-4, especially for high-definition video. H. 265 is a newer lossy algorithm that provides even better quality and efficiency than H.

Read On ›

Which video codecs have the best compression? ›

As of 2023, AV1 is recognized as the best video codec in terms of video quality, offering comparable visual quality to H. 265 while consuming 12% less data. This efficiency makes it highly suitable for modern video streaming demands.

View Details ›

Can you compress an AI file? ›

When we are saving the file for the first time (File > Save… or File > Save As…) this opens Illustrator's options dialog box. To significantly reduce the file size, untick Create PDF Compatible File and tick Use Compression. Such selection of options reduces the file size significantly.

Get More Info ›

Is it OK to delete wim file? ›

wim is also called “Windows Image File Boot”. It is a part of windows Installation media, deleting it might affect saved file, or application. I recommend not deleting it. Your best solution for freeing space on a computer is through the "uninstall a program" method.

View Details ›

Can you compress system drive? ›

If you want to compress the OS drive, select File Explorer from the taskbar, and in the left pane, select This PC. Right-click or tap and hold the OS drive (usually disk C), and then select Properties, and in the General tab, select Compress this drive to save disk space.