crush-gpu

GPU-accelerated tile-based compression engine with 32-way parallel decompression via GDeflate.

Overview

crush-gpu is the GPU acceleration crate for the Crush compression toolkit. It implements a GDeflate-inspired compression format designed for massively parallel GPU decompression.

Key design principles:

64 KB independent tiles enable parallel processing and random access
32-way sub-stream parallelism matches GPU warp/wavefront width
Batched GPU dispatch minimizes host-GPU synchronization overhead
Automatic CPU fallback when no GPU is available or decompression fails

Architecture

                    ┌─────────────────────────────────┐
                    │           engine.rs              │
                    │   compress() / decompress()      │
                    └──────┬──────────────┬────────────┘
                           │              │
                   ┌───────▼──────┐ ┌─────▼──────────┐
                   │  gdeflate.rs │ │  backend/       │
                   │  CPU compress│ │  GPU decompress │
                   │  CPU fallback│ │  (wgpu/CUDA)    │
                   └──────────────┘ └─────┬──────────-┘
                                          │
                                    ┌─────▼──────────┐
                                    │  shader/        │
                                    │  WGSL + CUDA    │
                                    └────────────────-┘

Modules

Module	Purpose
`engine`	Top-level compress/decompress API, tile management, random access
`backend`	GPU abstraction layer (`ComputeBackend` trait), wgpu implementation
`format`	CGPU binary file format (headers, tile index, footer)
`gdeflate`	GDeflate compression (CPU) and decompression (CPU fallback)
`scorer`	GPU eligibility scoring (file size, entropy, GPU availability)
`entropy`	Shannon entropy calculation for compressibility assessment
`vectorize`	Heuristic for text-heavy data detection
`lz77`	LZ77 match finding for the v1 sub-stream format

Performance

GPU Decompression Throughput

Test Environment:

GPU: NVIDIA GeForce RTX 3060 Ti (8 GB VRAM, 38 SMs)
Rust: 1.93.1 (stable), release mode

Benchmark Results (Criterion, wgpu Vulkan backend)

Corpus	GPU Throughput	CPU Throughput	Winner
log-1MB	141 MiB/s	319 MiB/s	CPU (small data)
binary-1MB	85 MiB/s	126 MiB/s	CPU (small data)
mixed-1MB	87 MiB/s	176 MiB/s	CPU (small data)
mixed-10MB	355 MiB/s	179 MiB/s	GPU 1.98x

GPU decompression outperforms CPU at larger data sizes where the per-tile dispatch overhead is amortized. The crossover point is around 2-4 MB.

Real-World Large File Performance (1.8 GB compressed, ~62K tiles)

Backend	Throughput	vs CPU
CUDA (multi-block)	560 MiB/s	3.1x
wgpu (Vulkan)	386 MiB/s	2.2x
CPU (all cores)	179 MiB/s	baseline

CUDA's multi-block kernel launch (grid_dim = num_tiles) distributes tiles across all 38 SMs simultaneously. For large files with thousands of tiles, CUDA achieves 1.45x over wgpu and 3.1x over CPU.

Multi-Block vs Per-Tile Dispatch

The CUDA backend uses a single multi-block kernel launch per batch (up to 512 tiles). Each CUDA block (32 threads) processes one tile, enabling all SMs to work in parallel:

Dispatch Mode	Throughput (large file)	Improvement
Per-tile sequential	~10 MiB/s	baseline
Multi-block (current)	560 MiB/s	56x

Compression Throughput (CPU)

Corpus	Throughput
log-text-1MB	179 MiB/s
binary-1MB	69 MiB/s
mixed-1MB	98 MiB/s
mixed-10MB	100 MiB/s

File Format (CGPU)

┌──────────────────────┐
│  GpuFileHeader (64B) │  Magic "CGPU", version, tile_size, tile_count
├──────────────────────┤
│  Tile 0              │  TileHeader (32B) + compressed payload
│  (128-byte aligned)  │
├──────────────────────┤
│  Tile 1              │
├──────────────────────┤
│  ...                 │
├──────────────────────┤
│  Tile Index          │  O(1) random access to any tile
├──────────────────────┤
│  GpuFileFooter (24B) │  Index offset, checksum, magic
└──────────────────────┘

Tile size: 64 KB (default), independently decompressible
Alignment: 128-byte boundaries for GPU memory coalescing
Checksums: Optional per-tile CRC32 integrity verification
Random access: Tile index enables O(1) decompression of any tile (~1 ms)

GDeflate Algorithm

GDeflate distributes DEFLATE across 32 parallel sub-streams:

LZ77 match finding (greedy, 3-byte hash chains)
Round-robin distribution of symbols across 32 sub-streams
Fixed Huffman encoding (BTYPE=01) per sub-stream
Interleaved serialization for GPU-friendly memory access

On decompression, 32 GPU threads each decode one sub-stream in parallel, then reconstruct the original data.

Reference: IETF draft draft-uralsky-gdeflate-00

Usage

As a Library

use crush_gpu::engine::{compress, decompress, EngineConfig};
use std::sync::atomic::AtomicBool;

let cancel = AtomicBool::new(false);
let config = EngineConfig::default();

// Compress (always on CPU)
let data = b"Hello, GPU compression!".repeat(1000);
let compressed = compress(&data, &config, &cancel).expect("compress");

// Decompress (GPU if available, CPU fallback)
let decompressed = decompress(&compressed, &config, &cancel).expect("decompress");
assert_eq!(data.as_slice(), decompressed.as_slice());

Configuration

use crush_gpu::engine::EngineConfig;

let config = EngineConfig {
    tile_size: 65536,          // 64 KB tiles (default)
    sub_stream_count: 32,      // Matches GPU warp width
    enable_checksums: true,    // Per-tile CRC32
    force_cpu: false,          // Allow GPU decompression
};

Random Access

use crush_gpu::engine::{load_tile_index, decompress_tile_by_index, EngineConfig};

let config = EngineConfig::default();
let archive: &[u8] = &compressed_data;

// Load tile index (O(1) per tile)
let index = load_tile_index(archive).expect("load index");

// Decompress only the tile you need
let tile_data = decompress_tile_by_index(archive, 42, &index, &config)
    .expect("decompress tile");

As a crush-core Plugin

crush-gpu auto-registers as a crush-core plugin via linkme:

use crush_core::{init_plugins, list_plugins};

init_plugins().expect("init");

for plugin in list_plugins() {
    println!("{}: {} MB/s", plugin.name, plugin.throughput);
    // Prints: gpu-deflate: 2000 MB/s
}

GPU Backend

Backends

Backend	API	Feature flag	GPU vendor
wgpu	Vulkan 1.2+ / Metal 2+ / DX12	(default)	Any supported GPU
CUDA	CUDA 13.x via nvrtc	`cuda`	NVIDIA (compute capability 7.0+)

Backend selection priority (with --gpu-backend auto, the default):

CUDA (if cuda feature enabled and NVIDIA GPU present)
wgpu (Vulkan on Windows/Linux, Metal on macOS)

Override at runtime: crush compress --gpu-backend cuda or --gpu-backend wgpu.

Requirements

2 GB+ VRAM (discrete GPU recommended)
Vulkan, Metal, or DX12 driver
CUDA backend additionally requires NVIDIA CUDA Toolkit 13.x and Visual Studio 2022 Build Tools (Windows)

GPU Eligibility

The scorer module automatically determines whether GPU acceleration benefits a given workload:

Criterion	Threshold
File size	> 100 MB
GPU available	Yes
Shannon entropy	< 7.5 bits/byte

All criteria must pass for GPU dispatch. High-entropy (incompressible) data is rejected to avoid wasting GPU resources.

Running Benchmarks

# Throughput benchmarks (GPU vs CPU decompression)
cargo bench --package crush-gpu --bench throughput

# Compression ratio benchmarks
cargo bench --package crush-gpu --bench ratio

# All crush-gpu tests
cargo test --package crush-gpu

Development

# Build
cargo build -p crush-gpu

# Build with CUDA support
cargo build -p crush-gpu --features cuda

# Test (standard — no GPU required for CI)
cargo test -p crush-gpu

# Clippy
cargo clippy -p crush-gpu --all-targets -- -D warnings

# Docs
cargo doc -p crush-gpu --no-deps

CUDA Testing

CUDA tests are feature-gated behind #[cfg(feature = "cuda")] and require:

An NVIDIA GPU with >= 2 GB VRAM
CUDA Toolkit nvrtc DLLs on PATH

On Windows with CUDA 13.1, add the runtime libraries to your shell:

# Git Bash / MSYS2
export PATH="/c/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v13.1/bin/x64:$PATH"

# PowerShell
$env:PATH = "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1\bin\x64;$env:PATH"

Then run:

cargo test -p crush-gpu --features cuda -- cuda --nocapture

This executes 10 CUDA-specific tests:

LZ77 roundtrip (small, 64 KB, multi-tile)
GDeflate roundtrip (multiple sizes, batch)
Cancellation (pre-set and mid-batch)
CUDA vs CPU output parity
Backend info validation

Full Local GPU Verification

export PATH="/c/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v13.1/bin/x64:$PATH"

# All crush-gpu tests (wgpu + CUDA)
cargo test -p crush-gpu --features cuda --nocapture

# Clippy (CUDA mode)
cargo clippy -p crush-gpu --features cuda --all-targets -- -D warnings

CI Notes

GitHub Actions free tier has no GPU. CI runs cargo clippy --all-targets -- -D warnings and cargo test without the cuda feature. GPU and CUDA tests are local-only.

Note: nvcc does not need to be on PATH at build time. The cudarc crate uses the explicit cuda-13010 feature flag instead of auto-detecting the CUDA version.

License

MIT

crush-gpu 0.1.1