# lzxc
LZX compression **encoder** in pure Rust. Produces byte streams that
round-trip through the [`lzxd`](https://crates.io/crates/lzxd) decoder.
Targets the LZX variant used by Xbox 360 XEX2 executables, CAB archives,
CHM help files, and Microsoft Patch (MS-PATCH) files.
## What's in this crate
- **`Encoder`** -- low-level chunk-at-a-time primitive. Feed up to 32 KB
of input per call, get back one compressed chunk. Use this when you
need custom chunk framing.
- **`EncoderWriter<W: Write>`** -- high-level `std::io::Write` sink.
Buffers input into 32 KB slabs, compresses each, forwards as
`u16 BE chunk_size | chunk_bytes` frames.
- **Canonical Huffman** with 16-bit length cap (falls back to
length-limited package-merge when the classical merge overflows).
- **Pretree delta+RLE** tree transmission with proactive symbol-19
(same-run) emission.
- **Stateful hash-chain match finder** that carries window history across
chunk boundaries so matches can span them.
- **E8 preprocessing** for x86 call-target translation (off by default;
not useful for Xbox 360 PPC payloads).
## Usage
```rust,no_run
use lzxc::{EncoderWriter, WindowSize};
use std::io::Write;
let mut out = Vec::<u8>::new();
let mut enc = EncoderWriter::new(&mut out, WindowSize::KB64);
enc.write_all(b"hello world, hello world, hello world!").unwrap();
enc.finish().unwrap();
// `out` now holds `u16 BE size | compressed chunk` frames back-to-back.
```
For raw chunk-at-a-time output (e.g. custom framing):
```rust,no_run
use lzxc::{Encoder, WindowSize, MAX_CHUNK_SIZE};
let input: &[u8] = b"...";
let mut enc = Encoder::new(WindowSize::KB64);
for chunk in input.chunks(MAX_CHUNK_SIZE) {
let compressed = enc.encode_chunk(chunk);
// compressed[..] is decodable by lzxd::Lzxd::decompress_next.
}
```
## Benchmarks
Measured with `cargo bench -p lzxc` on an Apple M4 Max, single thread. All
corpora are generated deterministically inside the bench binary, so results
reproduce on any host without shipping real executables.
Corpora:
- `text-256k`: 256 KB of English-ish prose built word-at-a-time from a
~200-word vocabulary via LCG. Byte distribution and bigram frequencies
look like natural language; no phrase-level periodicity.
- `text-256k-pathological`: 256 KB of a single 107-byte phrase repeated
end-to-end. Kept as a sanity-check ceiling on what LZ77+Huffman can do
on a perfectly periodic signal — it is **not** representative of
realistic text.
- `structured-1m`: 1 MB mix of zero-runs, a repeating 64-byte "jump table"
template, biased opcode-like bytes, and LCG random. Roughly mirrors the
compressibility profile of a typical executable without impersonating
any specific one.
- `random-256k`: 256 KB of LCG output. Worst case for LZ matching.
Compression ratio is `input_len / compressed_len` (higher is better).
### Strategy comparison (64 KB window)
| Greedy | text-256k | 2.50x | 94 MiB/s |
| Greedy | text-256k-pathological | 271.93x | 2.74 GiB/s |
| Greedy | structured-1m | 2.90x | 107 MiB/s |
| Greedy | random-256k | 1.00x | 57 MiB/s |
| LiteralOnly | text-256k | 1.92x | 256 MiB/s |
| LiteralOnly | text-256k-pathological | 1.78x | 254 MiB/s |
| LiteralOnly | structured-1m | 1.39x | 313 MiB/s |
| LiteralOnly | random-256k | 1.00x | 260 MiB/s |
| Uncompressed | text-256k | 1.00x | 38.2 GiB/s |
| Uncompressed | structured-1m | 1.00x | 45.6 GiB/s |
| Uncompressed | random-256k | 1.00x | 37.0 GiB/s |
Notes:
- `Greedy` on realistic English prose (`text-256k`) lands at 2.50x, within
the 2-3x range typical of LZ77+Huffman on natural language.
- `Greedy` on binary-shaped data (`structured-1m`) reaches 2.90x; this is
the workload `lzxc` was tuned for.
- `text-256k-pathological` — one 107-byte phrase repeated for 256 KB —
shows the theoretical ceiling: 272x because every match collapses to
the same main-tree symbol. Useful as a bound, not a headline.
- Random input triggers the encoder's uncompressed-block fallback inside
`Greedy`, and compresses to 1.00x under every strategy.
- `Uncompressed` is essentially `memcpy` with a 28-byte block header per
32 KB chunk.
### Window-size sweep (Greedy)
Window controls how far back matches can reach. Ratio plateaus quickly on
these corpora because the dominant match distances are well under 32 KB;
larger windows then pay throughput for bigger hash-chain / history
buffers that don't find new work.
| 32 KB | 2.51x / 97 MiB/s | 2.90x / 146 MiB/s |
| 64 KB | 2.50x / 94 MiB/s | 2.90x / 107 MiB/s |
| 128 KB | 2.50x / 96 MiB/s | 2.90x / 99 MiB/s |
| 512 KB | 2.51x / 96 MiB/s | 2.90x / 85 MiB/s |
| 1 MB | 2.51x / 95 MiB/s | 2.90x / 91 MiB/s |
| 2 MB | 2.51x / 93 MiB/s | 2.90x / 91 MiB/s |
Payloads with long-range repetition (e.g. multi-megabyte binaries with
duplicated data sections) are where larger windows earn their cost; see
[`WindowSize`](https://docs.rs/lzxc/latest/lzxc/enum.WindowSize.html) for
the full list.
To reproduce:
```
cargo bench -p lzxc
```
The bench binary also prints one `ratio: ...` line per corpus/strategy
pair to stderr so the numbers can be regenerated without opening the HTML
reports.
## Limitations
### 4 GiB input cap per encoder
A single `Encoder` / `EncoderWriter` can compress **up to 4 GiB of input**
before its match finder silently degrades to literal-only output. Internal
hash-chain positions are stored as `u32`, so once total bytes fed through
one encoder exceed `u32::MAX`, the stale-entry guard rejects every match
candidate. The output stream remains valid LZX (it decodes correctly via
`lzxd`), but compression ratio collapses to ~1.0x for all bytes past the
4 GiB mark.
In practice XEX payloads and other LZX targets are orders of magnitude
below this bound. If you need to compress larger streams, segment the
input and construct a fresh `Encoder` per segment.
## License
Licensed under either of Apache License, Version 2.0 or MIT license at
your option.