Back to main

Compression with fast context modelling - CM4

CM4 is a fast (~5 MB/sec on a modern desktop) context-modelling compressor I wrote for general-purpose data compression. Although it has no restrictions on the type of data, it is ideal for text and natural language compression. CM4 mixes low-order context modelling with an unbounded match model by using dynamic mixers (secondary symbol estimates). A standard PAQ-style range coder is used to encode input bits with calculated probabilities. Large file sizes (>4 GB) are supported.

A recursive generator is used to generate a finite state machine. Each state contains a 6-bit quantized probability of a given sequence of 1s and 0s, and the index of the next state for transitions. Context modelling is performed with a dense order-2 table and a hash table for order-4. The hash table resolves collisions using a key generated by XORing order-1 and order-0 (previous byte xor partial current byte). States are only looked up once per byte, rather than once per bit.

A match model is used to match and predict unbounded contexts beyond order-8 (practically, up to a few MB). A hash table maps hashes of the last 8 input bytes to a ring buffer. If an unexpected bit is encountered, the match is discarded to 0. There is no collision detection in the hash table; in practice, bad matches are quickly found and discarded, and the impact of short incorrect sequences is minimal. Secondary symbol estimates are used to mix the different weights together in two stages. The first stage accepts the quantized probabilities for order-2 and order-4 (or just order-2 if no order-4 is available) and outputs a single probability. Updating it based on gradient descent provides more accurate blending than with static weights. If a match is available, the predicted probability of the first stage and a match index are input into the second stage to produce a new predicted probability. The match index is based on the predicted bit and the number of bits correctly predicted up to that point. The number of bits is transformed with log2 to group distance ranges and the current bit within the byte. Longer matches have strongly biased probabilities.


64-bit exe

CM5 - 64-bit exe - A new model designed for better speeds. Compression ratios are slightly less, speed is much higher, and memory usage is much lower (~15 MB compared to 128 MB for CM4).