mismall - Streaming Huffman Compression Library

A sophisticated Rust library for file compression and decompression built around canonical Huffman coding with streaming architecture. Designed to handle arbitrarily large files with bounded memory usage and optional AES-256-GCM encryption.

🚀 Library Quick Start

Add this to your Cargo.toml:

[dependencies]
mismall = "2.0"

Highlights

Streaming Architecture: Bounded memory usage (16MB default) with chunked I/O for unlimited file size support
AES-256-GCM Encryption: Optional password-based encryption with authenticated data integrity
Archive Support: Pack multiple files into single .small containers with metadata
Memory Efficient: Uses temporary files for intermediate processing, never loads entire files into RAM
Raw-Store Heuristic: Automatically stores uncompressed data if compression would expand file size
Configurable Chunk Sizes: Users can adjust memory usage from 64KB to 1GB+ with --chunk-size flag
Deterministic Output: Lossless round-trip verified with SHA-256 during processing

Basic Library Usage

use mismall::{compress_stream, decompress_stream};
use std::io::Cursor;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create test data
    let input_data = b"Hello, world! This is test data for compression.";
    std::fs::write("test.txt", input_data)?;
    
    // Compress using stream API
    let mut reader = Cursor::new(input_data);
    let mut compressed = Vec::new();
    let result = mismall::compress_stream(&mut reader, "test.txt", None, &mut compressed, 1024 * 1024)?;
    
    println!("Compressed {} -> {} bytes ({:.1}% ratio)", 
             result.original_size, result.compressed_size, result.compression_ratio);
    
    // Save compressed data
    std::fs::write("test.txt.small", compressed)?;
    
    // Decompress the file
    let compressed_data = std::fs::read("test.txt.small")?;
    let mut compressed_reader = Cursor::new(compressed_data);
    let mut decompressed = Vec::new();
    let result = mismall::decompress_stream(&mut compressed_reader, None, &mut decompressed, 1024 * 1024)?;
    
    println!("Decompressed {} bytes", result.original_size);
    
    Ok(())
}

📦 Feature Flags

compression (default): Compression and decompression functionality
archives (default): Multi-file archive operations
encryption (default): AES-256-GCM encryption support
cli: Command-line interface (enables all other features)

[dependencies]
mismall = { version = "2.0", default-features = false, features = ["compression", "encryption"] }

🎯 Core Library APIs

Simple API

[compress_stream()] - Compress data streams with custom settings
[decompress_stream()] - Decompress data streams with custom settings

Builder API

[CompressionBuilder] - Advanced compression with options
[DecompressionBuilder] - Advanced decompression with options
[ArchiveBuilder] - Create multi-file archives
[ArchiveExtractor] - Extract from archives with options

Streaming API

[stream_reader()] - Read from compressed streams
[stream_writer()] - Write to compressed streams
[Compressor] - Stateful streaming compression
[Decompressor] - Stateful streaming decompression

🛠️ Library Examples

The examples/ directory contains comprehensive library examples:

simple_compress.rs - Basic compression and decompression
advanced_compression.rs - Compression with encryption and custom settings
archive_operations.rs - Multi-file archive creation and extraction
streaming.rs - Real-time streaming compression/decompression
performance.rs - Performance comparison and benchmarks

Run examples with:

cargo run --example simple_compress
cargo run --example advanced_compression
cargo run --example archive_operations
cargo run --example streaming
cargo run --example performance

📈 Performance Tips

For comprehensive performance optimization guidance, see PERFORMANCE.md:

Memory usage optimization for different system configurations
Chunk size selection strategies
Data type-specific recommendations
Encryption performance considerations
Streaming best practices
Benchmarking templates
Common pitfalls to avoid

🔧 Error Handling

All library functions return Result<T, MismallError> where MismallError provides detailed error information with context for troubleshooting.

match mismall::compress_stream(&mut reader, "test.txt", None, &mut output, 1024 * 1024) {
    Ok(result) => println!("Success: {} bytes compressed", result.compressed_size),
    Err(e) => eprintln!("Compression failed: {}", e),
}

CLI Tool Usage

The mismall library also includes a command-line interface. Install and use as follows:

Install

Single File Operations

Compress (with optional encryption and ratio display):
```
mismall compress [-r] [-p PASSWORD] [--chunk-size SIZE] <INPUT> [OUTPUT_BASENAME]
```
- If OUTPUT_BASENAME is omitted: output is <INPUT>.small
- If provided: output is <OUTPUT_BASENAME>.small
- --chunk-size: Memory usage (default 16MB, min 64KB recommended)
- -p: Optional password for AES-256-GCM encryption
Decompress:
```
mismall decompress [-p PASSWORD] [--chunk-size SIZE] <INPUT.small> [OUTPUT_NAME]
```
- If OUTPUT_NAME is omitted: restores original filename from header
- --chunk-size: Memory usage for decryption operations

Archive Operations

Create archive from directory:

mismall compress [-r] [-p PASSWORD] [--chunk-size SIZE] <DIRECTORY> [ARCHIVE_NAME]

List archive contents:
```
mismall list <ARCHIVE.small>
```

Extract from archive:

mismall extract-file [-p PASSWORD] [--chunk-size SIZE] <ARCHIVE.small> <FILENAME> [OUTPUT_NAME]

Memory Usage Guidelines

Low memory systems (1GB RAM): --chunk-size 65536 (64KB)
Standard systems (8GB+ RAM): Default 16MB (16,777,216 bytes)
High-memory systems (32GB+ RAM): --chunk-size 1073741824 (1GB)

How it works

Pass 1: Stream input file in configurable chunks to compute symbol frequencies and checksum
Codebook Generation: Build canonical Huffman tree and generate optimal code table
Pass 2: Stream input again, encoding data using bit-level packing with 4KB buffers
Encryption (optional): Apply AES-256-GCM with chunked processing and per-chunk authentication
Archive Creation: Combine multiple compressed files with metadata into single container
Decoding: Reverse process with streaming decryption and bit-level expansion

Performance Characteristics

Memory Usage

Bounded: Maximum memory usage = chunk-size + small overhead (~50KB)
Scalable: Handles arbitrarily large files with constant memory footprint
Temporary Storage: Uses OS temporary files for intermediate processing

Compression Performance

Text Files: 20-35% size reduction, linear time complexity
Source Code: 25-40% size reduction, fast encoding/decoding
Already-Compressed Media: Stored raw (no expansion), minimal overhead

Encryption Performance

AES-256-GCM: Hardware-accelerated on modern CPUs
Per-Chunk Authentication: Detect corruption early in the stream
Zero-Knowledge Security: PBKDF2 key derivation with random salt

Performance Snapshot (Intel i7, 16GB RAM)

Text / Structured Data

HTML (~4.5 MiB)
Ratio: 73% (to 3.3 MiB)
Encode: 92 ms. Decode: 80 ms.
Source file (~4.4 KiB)
Ratio: 63% (to 2.8 KiB)
Times: sub-millisecond

Small / Medium Binaries

Binary (~5.5 MiB)
Ratio: 82% (to 4.5 MiB)
Encode: 108 ms. Decode: 99 ms.
Binary (~82 MiB)
Ratio: 80% (to 65 MiB)
Encode: 1.6 s. Decode: 1.46 s.

Archive Operations

Multi-file archive: Linear scaling with total compressed size
Extraction: Constant time per file, regardless of archive size
Encryption overhead: ~16 bytes per 16MB chunk + 28 bytes header

Encryption Performance

AES-256-GCM: ~500 MB/s on modern CPUs with hardware acceleration
Memory overhead: Configurable chunk size (default 16MB)
Authentication: Per-chunk tags enable early corruption detection

Integrity

All tested files round-tripped PASS under SHA-256 verification
Chunk-level authentication: Detects corruption during streaming
Memory bounds: No buffer overflows or integer overflows in 66 tests

Limitations

Streaming I/O Required: Not designed for in-memory only operations (feature, not bug)
Huffman-Only Compression: Less effective on already-compressed media than DEFLATE/LZ77
No Parallel Processing: Single-threaded for simplicity and determinism

Testing

Mismall ships with a comprehensive test suite (66 tests) covering:

Core Logic: Huffman encoding/decoding with streaming architecture
Cryptographic Operations: Key derivation, encryption, decryption, authentication
Archive Management: Multi-file operations and metadata handling
Error Handling: Corrupted data, wrong passwords, edge cases
Memory Safety: Bounded memory usage under all conditions
I/O Operations: Bit-level reading/writing with proper padding
Integration: End-to-end compress/decompress/extract workflows

Run all tests with:

cargo test

Examples

# Basic compression with ratio
mismall compress -r document.txt

# Compressed with encryption and custom chunk size
mismall compress -p mypassword --chunk-size 8388608 large_video.mp4 encrypted_archive.small

# Decompress with password
mismall decompress -p mypassword encrypted_archive.small

# Create archive from directory
mismall compress project/ project_archive

# Extract specific file from archive
mismall extract-file project_archive.small src/main.rs main_backup.rs

# List archive contents
mismall list project_archive.small

License

MIT — do whatever you want, just don't claim you wrote it.

🔧 Legacy CLI (Version 1.0.0)

The original hand-crafted CLI implementation remains available as legacy version.

Access Legacy CLI

Option A: Checkout Directly

git clone https://github.com/gnik-snrub/make_it_small.git
cd make_it_small
git checkout f44054c9c7dd4813a5cdd41bbe8da2933409caa7
cargo install --path .

Option B: Version Pinning

cargo install mismall --locked --git https://github.com/gnik-snrub/make_it_small.git --branch main

Option C: Use legacy-cli Branch

cargo install mismall --locked --git https://github.com/gnik-snrub/make_it_small.git --branch legacy-cli

Repository Structure

Main Branch: Shows original hand-crafted CLI work (commit f44054c) AI Branch: Modern library transformation (ai-library-transformation)
Cargo Integration: Points to AI branch automatically via Cargo.toml

This means:

GitHub visitors see your original CLI work first
Cargo users get the modern library automatically
Legacy access remains available through branches/commits

Development History

Original Implementation: Hand-crafted CLI by Josiah Morris (up to commit f44054c)
Library Transformation: AI-assisted development (OpenAI/opencode) transforming CLI into production-ready library
Current State: Both versions accessible, library as primary focus

The transformation preserved all original concepts while adding comprehensive library capabilities.

mismall 2.0.0

mismall - Streaming Huffman Compression Library

🚀 Library Quick Start

Highlights

Basic Library Usage

📦 Feature Flags

🎯 Core Library APIs

Simple API

Builder API

Streaming API

🛠️ Library Examples

📈 Performance Tips

🔧 Error Handling

CLI Tool Usage

Install

Single File Operations

Archive Operations

Memory Usage Guidelines

How it works

Performance Characteristics

Memory Usage

Compression Performance

Encryption Performance

Performance Snapshot (Intel i7, 16GB RAM)

Text / Structured Data

Small / Medium Binaries

Archive Operations

Encryption Performance

Integrity

Limitations

Testing

Examples

License

🔧 Legacy CLI (Version 1.0.0)

Access Legacy CLI

Option A: Checkout Directly

Option B: Version Pinning

Option C: Use legacy-cli Branch

Repository Structure

Development History