mismall - Streaming Huffman Compression Library
A sophisticated Rust library for file compression and decompression built around canonical Huffman coding with streaming architecture. Designed to handle arbitrarily large files with bounded memory usage and optional AES-256-GCM encryption.
🚀 Library Quick Start
Add this to your Cargo.toml:
[]
= "2.0"
Highlights
- Streaming Architecture: Bounded memory usage (16MB default) with chunked I/O for unlimited file size support
- AES-256-GCM Encryption: Optional password-based encryption with authenticated data integrity
- Archive Support: Pack multiple files into single
.smallcontainers with metadata - Memory Efficient: Uses temporary files for intermediate processing, never loads entire files into RAM
- Raw-Store Heuristic: Automatically stores uncompressed data if compression would expand file size
- Configurable Chunk Sizes: Users can adjust memory usage from 64KB to 1GB+ with
--chunk-sizeflag - Deterministic Output: Lossless round-trip verified with SHA-256 during processing
Basic Library Usage
use ;
use Cursor;
📦 Feature Flags
compression(default): Compression and decompression functionalityarchives(default): Multi-file archive operationsencryption(default): AES-256-GCM encryption supportcli: Command-line interface (enables all other features)
[]
= { = "2.0", = false, = ["compression", "encryption"] }
🎯 Core Library APIs
Simple API
- [
compress_stream()] - Compress data streams with custom settings - [
decompress_stream()] - Decompress data streams with custom settings
Builder API
- [
CompressionBuilder] - Advanced compression with options - [
DecompressionBuilder] - Advanced decompression with options - [
ArchiveBuilder] - Create multi-file archives - [
ArchiveExtractor] - Extract from archives with options
Streaming API
- [
stream_reader()] - Read from compressed streams - [
stream_writer()] - Write to compressed streams - [
Compressor] - Stateful streaming compression - [
Decompressor] - Stateful streaming decompression
🛠️ Library Examples
The examples/ directory contains comprehensive library examples:
simple_compress.rs- Basic compression and decompressionadvanced_compression.rs- Compression with encryption and custom settingsarchive_operations.rs- Multi-file archive creation and extractionstreaming.rs- Real-time streaming compression/decompressionperformance.rs- Performance comparison and benchmarks
Run examples with:
📈 Performance Tips
For comprehensive performance optimization guidance, see PERFORMANCE.md:
- Memory usage optimization for different system configurations
- Chunk size selection strategies
- Data type-specific recommendations
- Encryption performance considerations
- Streaming best practices
- Benchmarking templates
- Common pitfalls to avoid
🔧 Error Handling
All library functions return Result<T, MismallError> where MismallError provides detailed error information with context for troubleshooting.
match compress_stream
CLI Tool Usage
The mismall library also includes a command-line interface. Install and use as follows:
Install
Single File Operations
-
Compress (with optional encryption and ratio display):
- If
OUTPUT_BASENAMEis omitted: output is<INPUT>.small - If provided: output is
<OUTPUT_BASENAME>.small --chunk-size: Memory usage (default 16MB, min 64KB recommended)-p: Optional password for AES-256-GCM encryption
- If
-
Decompress:
- If
OUTPUT_NAMEis omitted: restores original filename from header --chunk-size: Memory usage for decryption operations
- If
Archive Operations
-
Create archive from directory:
-
List archive contents:
-
Extract from archive:
Memory Usage Guidelines
- Low memory systems (1GB RAM):
--chunk-size 65536(64KB) - Standard systems (8GB+ RAM): Default 16MB (16,777,216 bytes)
- High-memory systems (32GB+ RAM):
--chunk-size 1073741824(1GB)
How it works
- Pass 1: Stream input file in configurable chunks to compute symbol frequencies and checksum
- Codebook Generation: Build canonical Huffman tree and generate optimal code table
- Pass 2: Stream input again, encoding data using bit-level packing with 4KB buffers
- Encryption (optional): Apply AES-256-GCM with chunked processing and per-chunk authentication
- Archive Creation: Combine multiple compressed files with metadata into single container
- Decoding: Reverse process with streaming decryption and bit-level expansion
Performance Characteristics
Memory Usage
- Bounded: Maximum memory usage =
chunk-size+ small overhead (~50KB) - Scalable: Handles arbitrarily large files with constant memory footprint
- Temporary Storage: Uses OS temporary files for intermediate processing
Compression Performance
- Text Files: 20-35% size reduction, linear time complexity
- Source Code: 25-40% size reduction, fast encoding/decoding
- Already-Compressed Media: Stored raw (no expansion), minimal overhead
Encryption Performance
- AES-256-GCM: Hardware-accelerated on modern CPUs
- Per-Chunk Authentication: Detect corruption early in the stream
- Zero-Knowledge Security: PBKDF2 key derivation with random salt
Performance Snapshot (Intel i7, 16GB RAM)
Text / Structured Data
-
HTML (~4.5 MiB)
Ratio: 73% (to 3.3 MiB)
Encode: 92 ms. Decode: 80 ms. -
Source file (~4.4 KiB)
Ratio: 63% (to 2.8 KiB)
Times: sub-millisecond
Small / Medium Binaries
-
Binary (~5.5 MiB)
Ratio: 82% (to 4.5 MiB)
Encode: 108 ms. Decode: 99 ms. -
Binary (~82 MiB)
Ratio: 80% (to 65 MiB)
Encode: 1.6 s. Decode: 1.46 s.
Archive Operations
- Multi-file archive: Linear scaling with total compressed size
- Extraction: Constant time per file, regardless of archive size
- Encryption overhead: ~16 bytes per 16MB chunk + 28 bytes header
Encryption Performance
- AES-256-GCM: ~500 MB/s on modern CPUs with hardware acceleration
- Memory overhead: Configurable chunk size (default 16MB)
- Authentication: Per-chunk tags enable early corruption detection
Integrity
- All tested files round-tripped PASS under SHA-256 verification
- Chunk-level authentication: Detects corruption during streaming
- Memory bounds: No buffer overflows or integer overflows in 66 tests
Limitations
- Streaming I/O Required: Not designed for in-memory only operations (feature, not bug)
- Huffman-Only Compression: Less effective on already-compressed media than DEFLATE/LZ77
- No Parallel Processing: Single-threaded for simplicity and determinism
Testing
Mismall ships with a comprehensive test suite (66 tests) covering:
- Core Logic: Huffman encoding/decoding with streaming architecture
- Cryptographic Operations: Key derivation, encryption, decryption, authentication
- Archive Management: Multi-file operations and metadata handling
- Error Handling: Corrupted data, wrong passwords, edge cases
- Memory Safety: Bounded memory usage under all conditions
- I/O Operations: Bit-level reading/writing with proper padding
- Integration: End-to-end compress/decompress/extract workflows
Run all tests with:
Examples
# Basic compression with ratio
# Compressed with encryption and custom chunk size
# Decompress with password
# Create archive from directory
# Extract specific file from archive
# List archive contents
License
MIT — do whatever you want, just don't claim you wrote it.
🔧 Legacy CLI (Version 1.0.0)
The original hand-crafted CLI implementation remains available as legacy version.
Access Legacy CLI
Option A: Checkout Directly
Option B: Version Pinning
Option C: Use legacy-cli Branch
Repository Structure
Main Branch: Shows original hand-crafted CLI work (commit f44054c)
AI Branch: Modern library transformation (ai-library-transformation)
Cargo Integration: Points to AI branch automatically via Cargo.toml
This means:
- GitHub visitors see your original CLI work first
- Cargo users get the modern library automatically
- Legacy access remains available through branches/commits
Development History
- Original Implementation: Hand-crafted CLI by Josiah Morris (up to commit f44054c)
- Library Transformation: AI-assisted development (OpenAI/opencode) transforming CLI into production-ready library
- Current State: Both versions accessible, library as primary focus
The transformation preserved all original concepts while adding comprehensive library capabilities.