Byte Punch - Profile-Aware CML Compression
Byte Punch is a lossless compression library designed specifically for CML (Content Markup Language) documents. It uses profile-aware semantic tokenization to achieve 40-70% compression ratios while maintaining 100% fidelity.
Overview
Byte Punch provides:
- 🗜️ 40-70% compression - Predictable, profile-specific compression
- ✅ 100% fidelity - Perfect round-trip: compress → decompress → identical
- 🚀 Fast - <100ms for 1MB documents
- 📋 Profile-aware - Dictionaries tuned for each CML profile
- 🔄 Bi-directional - Same dictionary for compression and decompression
- 📝 Git-friendly - Deterministic output, meaningful diffs
Quick Start
use ;
// Load profile dictionary
let dict = from_file?;
// Create compressor
let compressor = new;
// Compress
let cml_xml = r#"<cml version="0.1" encoding="utf-8" profile="code:api">...</cml>"#;
let compressed = compressor.compress?;
println!;
println!;
println!;
// Decompress
let decompressor = new;
let decompressed = decompressor.decompress?;
assert_eq!; // Perfect fidelity!
How It Works
Byte Punch uses a multi-level tokenization strategy:
1. Token Levels
| Encoding | Token Size | Capacity | Use Case |
|---|---|---|---|
| UTF-8 | 2-byte | 256 tokens | Core elements, common attributes |
| UTF-16 | 4-byte | 65,536 tokens | Profile-specific elements |
| UTF-32 | 8-byte | 4B tokens | Common phrases, boilerplate |
2. Compression Process
Input CML → Tokenizer → Replace with tokens → Compressed binary
| |
v v
<cml version="0.1"> [0xF001][0xF002]...
3. Dictionary Structure
Each profile has a JSON dictionary defining token mappings:
Profiles and Compression Ratios
code:api
Target: 50-60% compression
Optimized for:
- Rust keywords:
pub,fn,struct,enum,impl - Common types:
Vec,String,Option,Result - Documentation patterns: "Returns", "Panics", "Examples"
Example:
let dict = from_file?;
let compressor = new;
let cml = r#"
<cml version="0.1" encoding="utf-8" profile="code:api">
<body>
<struct id="std.vec.Vec" name="Vec">
<method name="push">
<signature>pub fn push(&mut self, value: T)</signature>
</method>
</struct>
</body>
</cml>
"#;
let compressed = compressor.compress?;
// Typical ratio: 50-60%
legal:constitution
Target: 60-70% compression
Optimized for:
- Legal terms: "article", "section", "clause", "amendment"
- Common phrases: "Congress shall", "United States", "shall not"
- Metadata: "ratified", "num", "id"
Example:
let dict = from_file?;
let compressor = new;
let cml = r#"
<cml version="0.1" encoding="utf-8" profile="legal:constitution">
<body>
<article num="I" title="Legislative Branch" id="article-1">
<section num="1" id="article-1-section-1">
<clause num="1" id="article-1-section-1-clause-1">
All legislative Powers herein granted shall be vested in a Congress...
</clause>
</section>
</article>
</body>
</cml>
"#;
let compressed = compressor.compress?;
// Typical ratio: 60-70%
bookstack:wiki
Target: 50-60% compression
Optimized for:
- Wiki structure: "book", "chapter", "page", "shelf"
- Content formats: "markdown", "html"
- Common patterns: "# ", "## ", "```rust"
Example:
let dict = from_file?;
let compressor = new;
let cml = r#"
<cml version="0.1" encoding="utf-8" profile="bookstack:wiki">
<body>
<book id="book-1" title="Rust Guide">
<chapter id="ch-1" title="Getting Started">
<page id="page-1" title="Setup">
<content format="markdown"><![CDATA[
# Development Environment Setup
...
]]></content>
</page>
</chapter>
</book>
</body>
</cml>
"#;
let compressed = compressor.compress?;
// Typical ratio: 50-60%
Compression Format
┌─────────────────────────────────────────────────────┐
│ Header (64 bytes) │
├─────────────────────────────────────────────────────┤
│ Magic: 0x42 0x50 0x43 0x4D (BPCM) │
│ Version: 0x01 0x00 │
│ Profile: UTF-8 string (32 bytes) │
│ Dictionary Hash: SHA-256 (32 bytes) │
├─────────────────────────────────────────────────────┤
│ Token Table (variable) │
├─────────────────────────────────────────────────────┤
│ Entry count: u32 │
│ Entries: [token_id, original_length, token_value] │
├─────────────────────────────────────────────────────┤
│ Compressed Content (variable) │
└─────────────────────────────────────────────────────┘
API Reference
Compressor
Decompressor
Dictionary
Error Handling
use ;
// Compression errors
match compressor.compress
// Decompression errors
match decompressor.decompress
Testing
# Run all tests
# Run with output
# Run specific test
# Run integration tests
Test Coverage:
- 22/22 tests passing ✅
- Unit tests for compression/decompression
- Dictionary loading and validation
- Round-trip tests for all profiles
- Integration tests with CML documents
Benchmarks
# Run benchmarks
# Example results on modern CPU:
# compress_1mb_code: ~80ms
# decompress_1mb_code: ~60ms
# compress_1mb_legal: ~75ms
# decompress_1mb_legal: ~55ms
Creating Custom Dictionaries
- Analyze your content:
# Count common words in your CML documents
| | | |
- Create dictionary JSON:
- Test compression:
let dict = from_file?;
let compressor = new;
let stats = compressor.stats;
println!;
- Iterate to optimize:
Adjust token assignments based on compression stats.
Integration with CML
Byte Punch is designed to work seamlessly with SAM CML:
use ;
use ;
// Parse CML
let cml = parse_cml?;
// Generate XML
let generator = CmlGenerator;
let xml = generator.generate_cml?;
// Compress
let dict = from_file?;
let compressor = new;
let compressed = compressor.compress?;
// Store compressed binary
write?;
Properties
Deterministic
Same input + dictionary = same output (always)
let compressed1 = compressor.compress?;
let compressed2 = compressor.compress?;
assert_eq!;
Editable
Decompress, edit, recompress:
let decompressed = decompressor.decompress?;
let edited = decompressed.replace;
let recompressed = compressor.compress?;
Git-Friendly
Dictionaries are versioned, diffs are meaningful:
# Dictionary changes are tracked
# Compressed files can be versioned
Performance Tips
- Reuse compressor instances:
let compressor = new; // Load once
for file in files
- Batch compress:
let files: = load_files?;
let compressed: = files
.par_iter // Use rayon for parallelism
.map
.?;
- Monitor compression ratios:
let stats = compressor.stats;
if stats.ratio > 0.7
Related Projects
- sam-cml - CML document library (sister crate)
- sam-engram - Engram packaging with Byte Punch compression
Documentation
- MASTER_PLAN.md - Implementation plan
- STATUS.md - Project status
- Dictionaries - Profile dictionaries
License
MIT OR Apache-2.0