lzxc 0.1.0

LZX compression encoder. Produces streams decodable by the `lzxd` crate.
Documentation
  • Coverage
  • 100%
    23 out of 23 items documented4 out of 19 items with examples
  • Size
  • Source code size: 120.6 kB This is the summed size of all the files inside the crates.io package for this release.
  • Documentation size: 3.61 MB This is the summed size of all files generated by rustdoc for all configured targets
  • Ø build duration
  • this release: 1m 4s Average build duration of successful builds.
  • all releases: 1m 4s Average build duration of successful builds in releases after 2024-10-23.
  • Links
  • landaire/acceleration
    7 0 0
  • crates.io
  • Dependencies
  • Versions
  • Owners
  • landaire

lzxc

LZX compression encoder in pure Rust. Produces byte streams that round-trip through the lzxd decoder.

Targets the LZX variant used by Xbox 360 XEX2 executables, CAB archives, CHM help files, and Microsoft Patch (MS-PATCH) files.

What's in this crate

  • Encoder -- low-level chunk-at-a-time primitive. Feed up to 32 KB of input per call, get back one compressed chunk. Use this when you need custom chunk framing.
  • EncoderWriter<W: Write> -- high-level std::io::Write sink. Buffers input into 32 KB slabs, compresses each, forwards as u16 BE chunk_size | chunk_bytes frames.
  • Canonical Huffman with 16-bit length cap (falls back to length-limited package-merge when the classical merge overflows).
  • Pretree delta+RLE tree transmission with proactive symbol-19 (same-run) emission.
  • Stateful hash-chain match finder that carries window history across chunk boundaries so matches can span them.
  • E8 preprocessing for x86 call-target translation (off by default; not useful for Xbox 360 PPC payloads).

Usage

use lzxc::{EncoderWriter, WindowSize};
use std::io::Write;

let mut out = Vec::<u8>::new();
let mut enc = EncoderWriter::new(&mut out, WindowSize::KB64);
enc.write_all(b"hello world, hello world, hello world!").unwrap();
enc.finish().unwrap();
// `out` now holds `u16 BE size | compressed chunk` frames back-to-back.

For raw chunk-at-a-time output (e.g. custom framing):

use lzxc::{Encoder, WindowSize, MAX_CHUNK_SIZE};

let input: &[u8] = b"...";
let mut enc = Encoder::new(WindowSize::KB64);
for chunk in input.chunks(MAX_CHUNK_SIZE) {
    let compressed = enc.encode_chunk(chunk);
    // compressed[..] is decodable by lzxd::Lzxd::decompress_next.
}

Benchmarks

Measured with cargo bench -p lzxc on an Apple M4 Max, single thread. All corpora are generated deterministically inside the bench binary, so results reproduce on any host without shipping real executables.

Corpora:

  • text-256k: 256 KB of English-ish prose built word-at-a-time from a ~200-word vocabulary via LCG. Byte distribution and bigram frequencies look like natural language; no phrase-level periodicity.
  • text-256k-pathological: 256 KB of a single 107-byte phrase repeated end-to-end. Kept as a sanity-check ceiling on what LZ77+Huffman can do on a perfectly periodic signal — it is not representative of realistic text.
  • structured-1m: 1 MB mix of zero-runs, a repeating 64-byte "jump table" template, biased opcode-like bytes, and LCG random. Roughly mirrors the compressibility profile of a typical executable without impersonating any specific one.
  • random-256k: 256 KB of LCG output. Worst case for LZ matching.

Compression ratio is input_len / compressed_len (higher is better).

Strategy comparison (64 KB window)

Strategy Corpus Ratio Throughput
Greedy text-256k 2.50x 94 MiB/s
Greedy text-256k-pathological 271.93x 2.74 GiB/s
Greedy structured-1m 2.90x 107 MiB/s
Greedy random-256k 1.00x 57 MiB/s
LiteralOnly text-256k 1.92x 256 MiB/s
LiteralOnly text-256k-pathological 1.78x 254 MiB/s
LiteralOnly structured-1m 1.39x 313 MiB/s
LiteralOnly random-256k 1.00x 260 MiB/s
Uncompressed text-256k 1.00x 38.2 GiB/s
Uncompressed structured-1m 1.00x 45.6 GiB/s
Uncompressed random-256k 1.00x 37.0 GiB/s

Notes:

  • Greedy on realistic English prose (text-256k) lands at 2.50x, within the 2-3x range typical of LZ77+Huffman on natural language.
  • Greedy on binary-shaped data (structured-1m) reaches 2.90x; this is the workload lzxc was tuned for.
  • text-256k-pathological — one 107-byte phrase repeated for 256 KB — shows the theoretical ceiling: 272x because every match collapses to the same main-tree symbol. Useful as a bound, not a headline.
  • Random input triggers the encoder's uncompressed-block fallback inside Greedy, and compresses to 1.00x under every strategy.
  • Uncompressed is essentially memcpy with a 28-byte block header per 32 KB chunk.

Window-size sweep (Greedy)

Window controls how far back matches can reach. Ratio plateaus quickly on these corpora because the dominant match distances are well under 32 KB; larger windows then pay throughput for bigger hash-chain / history buffers that don't find new work.

Window text-256k structured-1m
32 KB 2.51x / 97 MiB/s 2.90x / 146 MiB/s
64 KB 2.50x / 94 MiB/s 2.90x / 107 MiB/s
128 KB 2.50x / 96 MiB/s 2.90x / 99 MiB/s
512 KB 2.51x / 96 MiB/s 2.90x / 85 MiB/s
1 MB 2.51x / 95 MiB/s 2.90x / 91 MiB/s
2 MB 2.51x / 93 MiB/s 2.90x / 91 MiB/s

Payloads with long-range repetition (e.g. multi-megabyte binaries with duplicated data sections) are where larger windows earn their cost; see WindowSize for the full list.

To reproduce:

cargo bench -p lzxc

The bench binary also prints one ratio: ... line per corpus/strategy pair to stderr so the numbers can be regenerated without opening the HTML reports.

Limitations

4 GiB input cap per encoder

A single Encoder / EncoderWriter can compress up to 4 GiB of input before its match finder silently degrades to literal-only output. Internal hash-chain positions are stored as u32, so once total bytes fed through one encoder exceed u32::MAX, the stale-entry guard rejects every match candidate. The output stream remains valid LZX (it decodes correctly via lzxd), but compression ratio collapses to ~1.0x for all bytes past the 4 GiB mark.

In practice XEX payloads and other LZX targets are orders of magnitude below this bound. If you need to compress larger streams, segment the input and construct a fresh Encoder per segment.

License

Licensed under either of Apache License, Version 2.0 or MIT license at your option.