crush-gpu 0.1.1

GPU-accelerated tile-based compression engine with 32-way parallel decompression
Documentation

crush-gpu

GPU-accelerated tile-based compression engine with 32-way parallel decompression via GDeflate.

Overview

crush-gpu is the GPU acceleration crate for the Crush compression toolkit. It implements a GDeflate-inspired compression format designed for massively parallel GPU decompression.

Key design principles:

  • 64 KB independent tiles enable parallel processing and random access
  • 32-way sub-stream parallelism matches GPU warp/wavefront width
  • Batched GPU dispatch minimizes host-GPU synchronization overhead
  • Automatic CPU fallback when no GPU is available or decompression fails

Architecture

                    ┌─────────────────────────────────┐
                    │           engine.rs              │
                    │   compress() / decompress()      │
                    └──────┬──────────────┬────────────┘
                           │              │
                   ┌───────▼──────┐ ┌─────▼──────────┐
                   │  gdeflate.rs │ │  backend/       │
                   │  CPU compress│ │  GPU decompress │
                   │  CPU fallback│ │  (wgpu/CUDA)    │
                   └──────────────┘ └─────┬──────────-┘
                                          │
                                    ┌─────▼──────────┐
                                    │  shader/        │
                                    │  WGSL + CUDA    │
                                    └────────────────-┘

Modules

Module Purpose
engine Top-level compress/decompress API, tile management, random access
backend GPU abstraction layer (ComputeBackend trait), wgpu implementation
format CGPU binary file format (headers, tile index, footer)
gdeflate GDeflate compression (CPU) and decompression (CPU fallback)
scorer GPU eligibility scoring (file size, entropy, GPU availability)
entropy Shannon entropy calculation for compressibility assessment
vectorize Heuristic for text-heavy data detection
lz77 LZ77 match finding for the v1 sub-stream format

Performance

GPU Decompression Throughput

Test Environment:

  • GPU: NVIDIA GeForce RTX 3060 Ti (8 GB VRAM, 38 SMs)
  • Rust: 1.93.1 (stable), release mode

Benchmark Results (Criterion, wgpu Vulkan backend)

Corpus GPU Throughput CPU Throughput Winner
log-1MB 141 MiB/s 319 MiB/s CPU (small data)
binary-1MB 85 MiB/s 126 MiB/s CPU (small data)
mixed-1MB 87 MiB/s 176 MiB/s CPU (small data)
mixed-10MB 355 MiB/s 179 MiB/s GPU 1.98x

GPU decompression outperforms CPU at larger data sizes where the per-tile dispatch overhead is amortized. The crossover point is around 2-4 MB.

Real-World Large File Performance (1.8 GB compressed, ~62K tiles)

Backend Throughput vs CPU
CUDA (multi-block) 560 MiB/s 3.1x
wgpu (Vulkan) 386 MiB/s 2.2x
CPU (all cores) 179 MiB/s baseline

CUDA's multi-block kernel launch (grid_dim = num_tiles) distributes tiles across all 38 SMs simultaneously. For large files with thousands of tiles, CUDA achieves 1.45x over wgpu and 3.1x over CPU.

Multi-Block vs Per-Tile Dispatch

The CUDA backend uses a single multi-block kernel launch per batch (up to 512 tiles). Each CUDA block (32 threads) processes one tile, enabling all SMs to work in parallel:

Dispatch Mode Throughput (large file) Improvement
Per-tile sequential ~10 MiB/s baseline
Multi-block (current) 560 MiB/s 56x

Compression Throughput (CPU)

Corpus Throughput
log-text-1MB 179 MiB/s
binary-1MB 69 MiB/s
mixed-1MB 98 MiB/s
mixed-10MB 100 MiB/s

File Format (CGPU)

┌──────────────────────┐
│  GpuFileHeader (64B) │  Magic "CGPU", version, tile_size, tile_count
├──────────────────────┤
│  Tile 0              │  TileHeader (32B) + compressed payload
│  (128-byte aligned)  │
├──────────────────────┤
│  Tile 1              │
├──────────────────────┤
│  ...                 │
├──────────────────────┤
│  Tile Index          │  O(1) random access to any tile
├──────────────────────┤
│  GpuFileFooter (24B) │  Index offset, checksum, magic
└──────────────────────┘
  • Tile size: 64 KB (default), independently decompressible
  • Alignment: 128-byte boundaries for GPU memory coalescing
  • Checksums: Optional per-tile CRC32 integrity verification
  • Random access: Tile index enables O(1) decompression of any tile (~1 ms)

GDeflate Algorithm

GDeflate distributes DEFLATE across 32 parallel sub-streams:

  1. LZ77 match finding (greedy, 3-byte hash chains)
  2. Round-robin distribution of symbols across 32 sub-streams
  3. Fixed Huffman encoding (BTYPE=01) per sub-stream
  4. Interleaved serialization for GPU-friendly memory access

On decompression, 32 GPU threads each decode one sub-stream in parallel, then reconstruct the original data.

Reference: IETF draft draft-uralsky-gdeflate-00

Usage

As a Library

use crush_gpu::engine::{compress, decompress, EngineConfig};
use std::sync::atomic::AtomicBool;

let cancel = AtomicBool::new(false);
let config = EngineConfig::default();

// Compress (always on CPU)
let data = b"Hello, GPU compression!".repeat(1000);
let compressed = compress(&data, &config, &cancel).expect("compress");

// Decompress (GPU if available, CPU fallback)
let decompressed = decompress(&compressed, &config, &cancel).expect("decompress");
assert_eq!(data.as_slice(), decompressed.as_slice());

Configuration

use crush_gpu::engine::EngineConfig;

let config = EngineConfig {
    tile_size: 65536,          // 64 KB tiles (default)
    sub_stream_count: 32,      // Matches GPU warp width
    enable_checksums: true,    // Per-tile CRC32
    force_cpu: false,          // Allow GPU decompression
};

Random Access

use crush_gpu::engine::{load_tile_index, decompress_tile_by_index, EngineConfig};

let config = EngineConfig::default();
let archive: &[u8] = &compressed_data;

// Load tile index (O(1) per tile)
let index = load_tile_index(archive).expect("load index");

// Decompress only the tile you need
let tile_data = decompress_tile_by_index(archive, 42, &index, &config)
    .expect("decompress tile");

As a crush-core Plugin

crush-gpu auto-registers as a crush-core plugin via linkme:

use crush_core::{init_plugins, list_plugins};

init_plugins().expect("init");

for plugin in list_plugins() {
    println!("{}: {} MB/s", plugin.name, plugin.throughput);
    // Prints: gpu-deflate: 2000 MB/s
}

GPU Backend

Backends

Backend API Feature flag GPU vendor
wgpu Vulkan 1.2+ / Metal 2+ / DX12 (default) Any supported GPU
CUDA CUDA 13.x via nvrtc cuda NVIDIA (compute capability 7.0+)

Backend selection priority (with --gpu-backend auto, the default):

  1. CUDA (if cuda feature enabled and NVIDIA GPU present)
  2. wgpu (Vulkan on Windows/Linux, Metal on macOS)

Override at runtime: crush compress --gpu-backend cuda or --gpu-backend wgpu.

Requirements

  • 2 GB+ VRAM (discrete GPU recommended)
  • Vulkan, Metal, or DX12 driver
  • CUDA backend additionally requires NVIDIA CUDA Toolkit 13.x and Visual Studio 2022 Build Tools (Windows)

GPU Eligibility

The scorer module automatically determines whether GPU acceleration benefits a given workload:

Criterion Threshold
File size > 100 MB
GPU available Yes
Shannon entropy < 7.5 bits/byte

All criteria must pass for GPU dispatch. High-entropy (incompressible) data is rejected to avoid wasting GPU resources.

Running Benchmarks

# Throughput benchmarks (GPU vs CPU decompression)
cargo bench --package crush-gpu --bench throughput

# Compression ratio benchmarks
cargo bench --package crush-gpu --bench ratio

# All crush-gpu tests
cargo test --package crush-gpu

Development

# Build
cargo build -p crush-gpu

# Build with CUDA support
cargo build -p crush-gpu --features cuda

# Test (standard — no GPU required for CI)
cargo test -p crush-gpu

# Clippy
cargo clippy -p crush-gpu --all-targets -- -D warnings

# Docs
cargo doc -p crush-gpu --no-deps

CUDA Testing

CUDA tests are feature-gated behind #[cfg(feature = "cuda")] and require:

  1. An NVIDIA GPU with >= 2 GB VRAM
  2. CUDA Toolkit nvrtc DLLs on PATH

On Windows with CUDA 13.1, add the runtime libraries to your shell:

# Git Bash / MSYS2
export PATH="/c/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v13.1/bin/x64:$PATH"
# PowerShell
$env:PATH = "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1\bin\x64;$env:PATH"

Then run:

cargo test -p crush-gpu --features cuda -- cuda --nocapture

This executes 10 CUDA-specific tests:

  • LZ77 roundtrip (small, 64 KB, multi-tile)
  • GDeflate roundtrip (multiple sizes, batch)
  • Cancellation (pre-set and mid-batch)
  • CUDA vs CPU output parity
  • Backend info validation

Full Local GPU Verification

export PATH="/c/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v13.1/bin/x64:$PATH"

# All crush-gpu tests (wgpu + CUDA)
cargo test -p crush-gpu --features cuda --nocapture

# Clippy (CUDA mode)
cargo clippy -p crush-gpu --features cuda --all-targets -- -D warnings

CI Notes

GitHub Actions free tier has no GPU. CI runs cargo clippy --all-targets -- -D warnings and cargo test without the cuda feature. GPU and CUDA tests are local-only.

Note: nvcc does not need to be on PATH at build time. The cudarc crate uses the explicit cuda-13010 feature flag instead of auto-detecting the CUDA version.

License

MIT