lzxc
LZX compression encoder in pure Rust. Produces byte streams that
round-trip through the lzxd decoder.
Targets the LZX variant used by Xbox 360 XEX2 executables, CAB archives, CHM help files, and Microsoft Patch (MS-PATCH) files.
What's in this crate
Encoder-- low-level chunk-at-a-time primitive. Feed up to 32 KB of input per call, get back one compressed chunk. Use this when you need custom chunk framing.EncoderWriter<W: Write>-- high-levelstd::io::Writesink. Buffers input into 32 KB slabs, compresses each, forwards asu16 BE chunk_size | chunk_bytesframes.- Canonical Huffman with 16-bit length cap (falls back to length-limited package-merge when the classical merge overflows).
- Pretree delta+RLE tree transmission with proactive symbol-19 (same-run) emission.
- Stateful hash-chain match finder that carries window history across chunk boundaries so matches can span them.
- E8 preprocessing for x86 call-target translation (off by default; not useful for Xbox 360 PPC payloads).
Usage
use ;
use Write;
let mut out = Vec::new;
let mut enc = new;
enc.write_all.unwrap;
enc.finish.unwrap;
// `out` now holds `u16 BE size | compressed chunk` frames back-to-back.
For raw chunk-at-a-time output (e.g. custom framing):
use ;
let input: & = b"...";
let mut enc = new;
for chunk in input.chunks
Benchmarks
Measured with cargo bench -p lzxc on an Apple M4 Max, single thread. All
corpora are generated deterministically inside the bench binary, so results
reproduce on any host without shipping real executables.
Corpora:
text-256k: 256 KB of English-ish prose built word-at-a-time from a ~200-word vocabulary via LCG. Byte distribution and bigram frequencies look like natural language; no phrase-level periodicity.text-256k-pathological: 256 KB of a single 107-byte phrase repeated end-to-end. Kept as a sanity-check ceiling on what LZ77+Huffman can do on a perfectly periodic signal — it is not representative of realistic text.structured-1m: 1 MB mix of zero-runs, a repeating 64-byte "jump table" template, biased opcode-like bytes, and LCG random. Roughly mirrors the compressibility profile of a typical executable without impersonating any specific one.random-256k: 256 KB of LCG output. Worst case for LZ matching.
Compression ratio is input_len / compressed_len (higher is better).
Strategy comparison (64 KB window)
| Strategy | Corpus | Ratio | Throughput |
|---|---|---|---|
| Greedy | text-256k | 2.50x | 94 MiB/s |
| Greedy | text-256k-pathological | 271.93x | 2.74 GiB/s |
| Greedy | structured-1m | 2.90x | 107 MiB/s |
| Greedy | random-256k | 1.00x | 57 MiB/s |
| LiteralOnly | text-256k | 1.92x | 256 MiB/s |
| LiteralOnly | text-256k-pathological | 1.78x | 254 MiB/s |
| LiteralOnly | structured-1m | 1.39x | 313 MiB/s |
| LiteralOnly | random-256k | 1.00x | 260 MiB/s |
| Uncompressed | text-256k | 1.00x | 38.2 GiB/s |
| Uncompressed | structured-1m | 1.00x | 45.6 GiB/s |
| Uncompressed | random-256k | 1.00x | 37.0 GiB/s |
Notes:
Greedyon realistic English prose (text-256k) lands at 2.50x, within the 2-3x range typical of LZ77+Huffman on natural language.Greedyon binary-shaped data (structured-1m) reaches 2.90x; this is the workloadlzxcwas tuned for.text-256k-pathological— one 107-byte phrase repeated for 256 KB — shows the theoretical ceiling: 272x because every match collapses to the same main-tree symbol. Useful as a bound, not a headline.- Random input triggers the encoder's uncompressed-block fallback inside
Greedy, and compresses to 1.00x under every strategy. Uncompressedis essentiallymemcpywith a 28-byte block header per 32 KB chunk.
Window-size sweep (Greedy)
Window controls how far back matches can reach. Ratio plateaus quickly on these corpora because the dominant match distances are well under 32 KB; larger windows then pay throughput for bigger hash-chain / history buffers that don't find new work.
| Window | text-256k | structured-1m |
|---|---|---|
| 32 KB | 2.51x / 97 MiB/s | 2.90x / 146 MiB/s |
| 64 KB | 2.50x / 94 MiB/s | 2.90x / 107 MiB/s |
| 128 KB | 2.50x / 96 MiB/s | 2.90x / 99 MiB/s |
| 512 KB | 2.51x / 96 MiB/s | 2.90x / 85 MiB/s |
| 1 MB | 2.51x / 95 MiB/s | 2.90x / 91 MiB/s |
| 2 MB | 2.51x / 93 MiB/s | 2.90x / 91 MiB/s |
Payloads with long-range repetition (e.g. multi-megabyte binaries with
duplicated data sections) are where larger windows earn their cost; see
WindowSize for
the full list.
To reproduce:
cargo bench -p lzxc
The bench binary also prints one ratio: ... line per corpus/strategy
pair to stderr so the numbers can be regenerated without opening the HTML
reports.
Limitations
4 GiB input cap per encoder
A single Encoder / EncoderWriter can compress up to 4 GiB of input
before its match finder silently degrades to literal-only output. Internal
hash-chain positions are stored as u32, so once total bytes fed through
one encoder exceed u32::MAX, the stale-entry guard rejects every match
candidate. The output stream remains valid LZX (it decodes correctly via
lzxd), but compression ratio collapses to ~1.0x for all bytes past the
4 GiB mark.
In practice XEX payloads and other LZX targets are orders of magnitude
below this bound. If you need to compress larger streams, segment the
input and construct a fresh Encoder per segment.
License
Licensed under either of Apache License, Version 2.0 or MIT license at your option.