wimlib
the open source Windows Imaging (WIM) library
- Main page
- Downloads
- Forums
- Browse source
- Compression algorithms
wimlib supports compression and decompression in all of the compressionformats known to be used in WIM archives:
- XPRESS
- LZX
- LZMS
wimlib's compressors for these formats usually outperform and outcompresstheir closed-source Microsoft equivalents. See the benchmarks.
Since these formats are not well known outside of Microsoft, someextra information is provided below.
wimlib's XPRESS and LZX decompressors are also reused in the NTFS-3G systemcompression plugin by the same author, which extends NTFS-3G withsupport for reading files compressed with the "system compression" featureMicrosoft introduced in Windows 10.
XPRESS
The Huffman variant of the XPRESS compression format uses LZ77-style dictionary compression combined with Huffman coding. It isdesigned for fast compression and decompression with a small dictionary size(typically 4096 to 65,536 bytes). The uncompressed data is represented as asequence of literals and matches. A literal is a single byte, whereas a matchconsists of a (length, offset) pair which tells the decompressor to copy 3 ormore bytes from data it has already decompressed. A single alphabet of 512symbols combines literals (256 possible values) and match headers (256 morepossible values) and is Huffman-coded.
Example uses of XPRESS:
- WIM archives. To create a WIM archive that uses XPRESS compression, use: /compress:fast with DISM, or --compress=fast or --compress=xpress with wimlib-imagex.
- System compression (a.k.a. Windows 10 file compression or "CompactOS")
- Windows compression API
wimlib's XPRESScompressor supports different compression levels (the default is level 50):
- Very fast (level < 30): Greedy parsing with matches found using HC4 matchfinder (hash chains with 4 bytes hashing, plus a table with 3 bytes hashing). The compression ratio is similar to that of Microsoft's XPRESS compressor.
- Fast (30 ≤ level < 60): Lazy parsing with matches found using HC4 matchfinder (hash chains with 4 bytes hashing, plus a table with 3 bytes hashing). The compression ratio is better than that of Microsoft's XPRESS compressor.
- Normal/slow (level ≥ 60): "Near-optimal" parsing with matches found using BT4 matchfinder (binary trees with 4 bytes hashing, plus a table with 3 bytes hashing). The compression ratio is much better than that of Microsoft's XPRESS compressor. The techniques used are similar to those used in wimlib's LZX compressor; see LZX for more details.
LZX
The LZX format is basically a slightly improved version of the more commonlyused DEFLATE format (theformat used in ZIP files, and by zlib and gzip). It combines LZ77-style dictionary compression with Huffman coding and isintended to be used with a small-to-medium dictionary size (32,768 to 2,097,152bytes). Uncompressed data is represented a sequence of literals and matches.Literals and match headers are combined into a single Huffman-coded "mainalphabet". The lengths of long matches are Huffman-coded using the "lengthalphabet". The compressor can choose to divide the input into multiple"blocks", each of which uses separate Huffman codes. In addition, the LZXformat includes repeat offset codes, the option to entropy-encode the bottomthree bits of match offsets, and x86 machine code preprocessing. Thesecharacteristics allow "binary" data to be compressed better with LZX than withDEFLATE. Other improvements over DEFLATE are: no coding space is wasted on anend-of-block symbol; and the main alphabet captures length-offsetcorrelation.
Example uses of LZX:
- WIM archives. To create a WIM archive that uses LZX compression, use: /compress:maximum with DISM, or --compress=maximum or --compress=lzx with wimlib-imagex, or simply take advantage of the fact that wimlib-imagex defaults to LZX compression.
- Windows 10 "compact mode" (System compression)
- Cabinet files
- Chm files
- Xbox live avatars
wimlib's LZXcompressor supports different compression levels (the default is level 50):
- Fast (level < 35): Lazy parsing with matches found using HC4 matchfinder (hash chains with 4 bytes hashing, plus a table with 3 bytes hashing) and repeat offset match searching. The compression ratio is worse than that of Microsoft's LZX compressor, but the speed is quite fast — almost as fast as wimlib's fast mode for XPRESS.
-
Normal/slow (level ≥ 35): "Near-optimal" parsing with matches found using BT4 matchfinder (binary trees with 4 bytes hashing, plus tables with 3 and 2 bytes hashing), along with repeat offset match searching. The compression ratio is better than that of Microsoft's LZX compressor, while still being significantly faster. Because this is the default compression mode of wimlib-imagex, this is the most carefully engineered compressor available in wimlib. However, the LZX format does not allow as high a compression ratio as LZMS if a sufficiently large dictionary is used.
The compressor starts by preprocessing the input buffer. Then, it advances through the buffer and finds LZ77 matches. For each byte position, the 2, 3, and 4-byte sequences beginning at that position are hashed using a multiplicative hash function. The 2 and 3-byte hashes identify entries in the hash2 and hash3 tables, respectively. If either entry contains a position value, that position is checked for a length 2 or 3 match with the current position, respectively; then the entry is replaced with the current position. Assuming no hash collisions, this finds the closest length 2 and length 3 matches. Then, the 4-byte hash identifies a binary tree containing prior sequences (or positions). The sequence beginning with the 4-byte sub-sequence is searched for in the tree, and matches are reported along the search path. At the same time, the binary tree is re-rooted at the current sequence, which ensures that small offsets remain close to the root of the tree.
At some point, the compressor will decide it is time to end the current block. This happens when either the end of the input is reached, a maximum block size is reached, or when a heuristic determines that the type of data being compressed has changed significantly. Once this decision is made, the compressor runs a minimum-cost path graph search algorithm to find a concise compressed representation of the block. The (conceptual) graph being searched consists of a node per uncompressed byte and an edge per match or literal which can represent the sequence of uncompressed data between the two connected nodes. The cost of each edge is the number of bits required to represent that match or literal. A path through this graph therefore represents a valid compressed representation of the block's data with some cost, and we want the one with the minimum cost (or at least a low cost). The minimum-cost path algorithm can be thought of as similar to Dijkstra's algorithm, but it does not need to be as general (no priority queue is needed) because the graph is already topologically sorted by virtue of nodes being associated with byte positions.
At the default compression level (50), the minimum-cost path algorithm is actually run twice. The first pass uses edge costs derived from a mix of static default values and statistics collected about the block, whereas the second pass uses edge costs derived from the Huffman codes that were computed after the first pass. At higher compression levels, more than two passes are used for greater accuracy. This is effective in practice but is not optimal because the final Huffman codeword lengths cannot be predicted exactly, and more generally it is unknown which codeword lengths would actually produce the optimal solution.
Repeat offset matches are handled specially. Because the state of the recent offsets queue cannot be predicted in advance, repeat offset matches are searched for during the graph search step, not during the main matchfinding step. This also implies that the graph search must proceed in the forwards direction rather than in the backwards direction, since some edges aren't known in advance.
The compressor uses various heuristics to avoid an exhaustive search. For example, it only considers the smallest offset at which each distinct match length is available (e.g. if there is a length 3 match at offsets 100 and 1000, then it will only consider the one at offset 100). The compressor is also subject to cut-offs in cases where there are too many matches or the matches are very long.
Once the compressor has chosen the match/literal sequence for each block, it outputs the actual compressed bits, using Huffman encoding when required.
LZMS
LZMS is an undocumented compression format that Microsoft released in 2012or 2013. It perhaps could be described as Microsoft's answer to LZMA(i.e. the format used in 7z and xz files), although LZMS usuallyproduces a worse compression ratio than LZMA. Like LZMA, LZMS is an LZ77-basedalgorithm. It achieves a relatively high compression ratio by relying on alarge LZ77 dictionary size (up to 67,108,864 bytes) and statistically modellingthe LZ77 stream of literals and matches. Unlike LZMA but like some LZMAcompetitors such as LZHAM, LZMS uses Huffmancoding in addition to the more concise arithmetic coding, presumably to make decompression faster. The Huffman codes arerebuilt periodically and are not stored with the compressed data. The formatincludes a preprocessing step for x86 and x86_64 machine code. It does notinclude a "delta" filter for multimedia data but rather allows a special"delta" match type in addition to the traditional LZ77 match type.
Example uses of LZMS:
- Solid WIM archives (ESD files), introduced in Windows 8. All the file data in these archives is concatenated together, divided into 64 MiB chunks, and compressed using LZMS compression. To create such a file with DISM, use: /compress:recovery. To create such a file with wimlib-imagex, use: --solid.
- Windows compression API. Developers can use LZMS compression in their applications. The API doesn't seem to support sliding-window compression — like in solid WIM archives, you're supposed to divide your data into blocks, and compress each block independently. The LZ77 dictionary size is the same as the block size.
wimlib's LZMScompressor supports different compression levels but isprimarily optimized for the default level (50). Similar to commonly used LZMAcompressors, wimlib's LZMS compressor performs a heuristic graph search oversmall regions of the uncompressed data to choose a concise compressedrepresentation, considering many different possibilities and choosing the onewhich requires the fewest bits. LZ77-style matches are found using a stringsearch algorithm which handles very large buffers (tens or hundreds ofmegabytes in size) relatively efficiently. First, the string search algorithmbuilds the suffix arrayof the input data using libdivsufsort by Yuta Mori.Second, the string search algorithm uses the suffix array to build a compactrepresentation of thelcp-interval tree. Third,for each input buffer position in sequential order, the string search algorithmreports a match for each lcp-interval that contains that position's rank, waspreviously visited, and has the greatest lcp value among all lcp-intervals underconsideration which were last visited equally recently. The reported matchlength is the lcp value of the lcp-interval, and the reported match offset isthe distance to the position at which that lcp-interval was last visited. More information can befound in the source code, Kasai et al (2001), Abouelhoda et al (2004), and Chen et al (2008). LZMS "Delta" matches are found using simpler data structure(a hash table).
Benchmarks
As mentioned, wimlib's compressors for XPRESS, LZX, and LZMS usuallyoutperform and outcompress their Microsoft equivalents. In addition, morecompression settings are provided. Although results will vary depending on thedata being compressed, the table below shows results for a common use case:creating an x86 Windows PE image ("boot.wim"). Each row shows the compressiontype, the size of the resulting WIM archive in bytes, and the time it took tocreate the file. When possible, the results with the Microsoft equivalent areincluded.
Compression | wimlib (v1.10.0) | WIMGAPI (Windows 10) |
---|---|---|
LZMS (solid) [1] | 88,114,666 in 122.2s | 88,772,520 in 201.4s |
LZMS (non-solid) [2] | 116,147,378 in 60.0s | N/A |
LZX (slow) [3] | 125,514,977 in 48.4s | N/A |
LZX (normal) [4] | 125,929,467 in 27.5s | 127,295,034 in 45.9s |
LZX (quick) [5] | 129,997,269 in 4.5s | N/A |
XPRESS (slow) [6] | 135,147,924 in 20.8s | N/A |
XPRESS [7] | 137,953,928 in 3.6s | 140,459,585 in 8.5s |
"WIMBoot" (slow) [8] | 165,005,928 in 18.5s | N/A |
"WIMBoot" [9] | 166,893,119 in 3.7s | 169,111,375 in 10.5s |
None [10] | 361,315,118 in 1.4s | 361,315,144 in 5.4s |
Notes:
- --solid for wimlib-imagex; /compress:recovery for DISM. Compression chunk size in solid resources defaults to 67108864 bytes in both cases.
- --compress=LZMS for wimlib-imagex. Compression chunk size defaults to 131072 bytes.
- --compress=LZX:100 for wimlib-imagex. Compression chunk size defaults to 32768 bytes.
- --compress=LZX:50 or --compress=LZX or no option for wimlib-imagex; /compress:maximum for DISM. Compression chunk size defaults to 32768 bytes in both cases.
- --compress=LZX:20 for wimlib-imagex. Compression chunk size defaults to 32768 bytes.
- --compress=XPRESS:80 for wimlib-imagex. Compression chunk size defaults to 32768 bytes.
- --compress=XPRESS:50 or --compress=XPRESS for wimlib-imagex; /compress:fast for DISM. Compression chunk size defaults to 32768 bytes in both cases.
- --wimboot --compress=XPRESS:80 for wimlib-imagex. Same format as [9], but trying harder to get a good compression ratio.
- --wimboot for wimlib-imagex; /wimboot for DISM. Equivalent to --compress=XPRESS --chunk-size=4096
- --compress=none for wimlib-imagex; /compress:none for DISM.
wimlib-imagex's --compress option also accepts the "fast", "maximum", and"recovery" aliases for XPRESS, LZX, and LZMS, respectively.
Testing environment:
- 64 bit binaries
- Windows 10 virtual machine running on Linux with VT-x
- 2 CPUs and 3 GB memory given to virtual machine
- SSD-backed virtual disk
- All tests done with page cache warmed
The compression ratio provided by wimlib is also competitive with commonlyused archive formats. Below are file sizes that result when the Canterburycorpus is compressed with wimlib (v1.10.0), WIMGAPI (Windows 10), and some otherformats/programs:
Format | Size (bytes) |
---|---|
7z (7-zip, -9) | 483,239 |
7z (7-zip, default) | 484,700 |
tar.xz (xz, -9) | 486,904 |
tar.xz (xz, default) | 486,916 |
WIM (wimlib, LZMS solid) | 515,800 |
WIM (WIMGAPI, LZMS solid) | 521,366 |
tar.bz2 (bzip, -9) | 565,008 |
tar.bz2 (bzip, default) | 565,008 |
WIM (wimlib, LZMS non-solid) | 581,046 |
WIM (wimlib, LZX slow) | 618,766 |
WIM (wimlib, LZX normal) | 621,898 |
WIM (WIMGAPI, LZX) | 651,866 |
WIM (wimlib, LZX quick) | 686,420 |
ZIP (Info-ZIP, -9) | 732,297 |
tar.gz (gzip, -9) | 733,971 |
ZIP (Info-ZIP, default) | 735,334 |
tar.gz (gzip, default) | 738,796 |
WIM (wimlib, XPRESS) | 787,356 |
WIM (WIMGAPI, XPRESS) | 825,536 |
WIM (wimlib, None) | 2,814,216 |
WIM (WIMGAPI, None) | 2,814,254 |
tar | 2,826,240 |
WIM does even better on directory trees containing duplicate files, which theCanterbury corpus doesn't have.
wimlib.net (website) Copyright 2015-2023 Eric Biggers
wimlib (software) Copyright 2012-2023 Eric Biggers