crush-gpu
GPU-accelerated tile-based compression engine with 32-way parallel decompression via GDeflate.
Overview
crush-gpu is the GPU acceleration crate for the Crush compression toolkit. It implements a GDeflate-inspired compression format designed for massively parallel GPU decompression.
Key design principles:
- 64 KB independent tiles enable parallel processing and random access
- 32-way sub-stream parallelism matches GPU warp/wavefront width
- Batched GPU dispatch minimizes host-GPU synchronization overhead
- Automatic CPU fallback when no GPU is available or decompression fails
Architecture
┌─────────────────────────────────┐
│ engine.rs │
│ compress() / decompress() │
└──────┬──────────────┬────────────┘
│ │
┌───────▼──────┐ ┌─────▼──────────┐
│ gdeflate.rs │ │ backend/ │
│ CPU compress│ │ GPU decompress │
│ CPU fallback│ │ (wgpu/CUDA) │
└──────────────┘ └─────┬──────────-┘
│
┌─────▼──────────┐
│ shader/ │
│ WGSL + CUDA │
└────────────────-┘
Modules
| Module | Purpose |
|---|---|
engine |
Top-level compress/decompress API, tile management, random access |
backend |
GPU abstraction layer (ComputeBackend trait), wgpu implementation |
format |
CGPU binary file format (headers, tile index, footer) |
gdeflate |
GDeflate compression (CPU) and decompression (CPU fallback) |
scorer |
GPU eligibility scoring (file size, entropy, GPU availability) |
entropy |
Shannon entropy calculation for compressibility assessment |
vectorize |
Heuristic for text-heavy data detection |
lz77 |
LZ77 match finding for the v1 sub-stream format |
Performance
GPU Decompression Throughput
Test Environment:
- GPU: NVIDIA GeForce RTX 3060 Ti (8 GB VRAM, 38 SMs)
- Rust: 1.93.1 (stable), release mode
Benchmark Results (Criterion, wgpu Vulkan backend)
| Corpus | GPU Throughput | CPU Throughput | Winner |
|---|---|---|---|
| log-1MB | 141 MiB/s | 319 MiB/s | CPU (small data) |
| binary-1MB | 85 MiB/s | 126 MiB/s | CPU (small data) |
| mixed-1MB | 87 MiB/s | 176 MiB/s | CPU (small data) |
| mixed-10MB | 355 MiB/s | 179 MiB/s | GPU 1.98x |
GPU decompression outperforms CPU at larger data sizes where the per-tile dispatch overhead is amortized. The crossover point is around 2-4 MB.
Real-World Large File Performance (1.8 GB compressed, ~62K tiles)
| Backend | Throughput | vs CPU |
|---|---|---|
| CUDA (multi-block) | 560 MiB/s | 3.1x |
| wgpu (Vulkan) | 386 MiB/s | 2.2x |
| CPU (all cores) | 179 MiB/s | baseline |
CUDA's multi-block kernel launch (grid_dim = num_tiles) distributes tiles across all 38 SMs simultaneously. For large files with thousands of tiles, CUDA achieves 1.45x over wgpu and 3.1x over CPU.
Multi-Block vs Per-Tile Dispatch
The CUDA backend uses a single multi-block kernel launch per batch (up to 512 tiles). Each CUDA block (32 threads) processes one tile, enabling all SMs to work in parallel:
| Dispatch Mode | Throughput (large file) | Improvement |
|---|---|---|
| Per-tile sequential | ~10 MiB/s | baseline |
| Multi-block (current) | 560 MiB/s | 56x |
Compression Throughput (CPU)
| Corpus | Throughput |
|---|---|
| log-text-1MB | 179 MiB/s |
| binary-1MB | 69 MiB/s |
| mixed-1MB | 98 MiB/s |
| mixed-10MB | 100 MiB/s |
File Format (CGPU)
┌──────────────────────┐
│ GpuFileHeader (64B) │ Magic "CGPU", version, tile_size, tile_count
├──────────────────────┤
│ Tile 0 │ TileHeader (32B) + compressed payload
│ (128-byte aligned) │
├──────────────────────┤
│ Tile 1 │
├──────────────────────┤
│ ... │
├──────────────────────┤
│ Tile Index │ O(1) random access to any tile
├──────────────────────┤
│ GpuFileFooter (24B) │ Index offset, checksum, magic
└──────────────────────┘
- Tile size: 64 KB (default), independently decompressible
- Alignment: 128-byte boundaries for GPU memory coalescing
- Checksums: Optional per-tile CRC32 integrity verification
- Random access: Tile index enables O(1) decompression of any tile (~1 ms)
GDeflate Algorithm
GDeflate distributes DEFLATE across 32 parallel sub-streams:
- LZ77 match finding (greedy, 3-byte hash chains)
- Round-robin distribution of symbols across 32 sub-streams
- Fixed Huffman encoding (BTYPE=01) per sub-stream
- Interleaved serialization for GPU-friendly memory access
On decompression, 32 GPU threads each decode one sub-stream in parallel, then reconstruct the original data.
Reference: IETF draft draft-uralsky-gdeflate-00
Usage
As a Library
use ;
use AtomicBool;
let cancel = new;
let config = default;
// Compress (always on CPU)
let data = b"Hello, GPU compression!".repeat;
let compressed = compress.expect;
// Decompress (GPU if available, CPU fallback)
let decompressed = decompress.expect;
assert_eq!;
Configuration
use EngineConfig;
let config = EngineConfig ;
Random Access
use ;
let config = default;
let archive: & = &compressed_data;
// Load tile index (O(1) per tile)
let index = load_tile_index.expect;
// Decompress only the tile you need
let tile_data = decompress_tile_by_index
.expect;
As a crush-core Plugin
crush-gpu auto-registers as a crush-core plugin via linkme:
use ;
init_plugins.expect;
for plugin in list_plugins
GPU Backend
Backends
| Backend | API | Feature flag | GPU vendor |
|---|---|---|---|
| wgpu | Vulkan 1.2+ / Metal 2+ / DX12 | (default) | Any supported GPU |
| CUDA | CUDA 13.x via nvrtc | cuda |
NVIDIA (compute capability 7.0+) |
Backend selection priority (with --gpu-backend auto, the default):
- CUDA (if
cudafeature enabled and NVIDIA GPU present) - wgpu (Vulkan on Windows/Linux, Metal on macOS)
Override at runtime: crush compress --gpu-backend cuda or --gpu-backend wgpu.
Requirements
- 2 GB+ VRAM (discrete GPU recommended)
- Vulkan, Metal, or DX12 driver
- CUDA backend additionally requires NVIDIA CUDA Toolkit 13.x and Visual Studio 2022 Build Tools (Windows)
GPU Eligibility
The scorer module automatically determines whether GPU acceleration benefits a given workload:
| Criterion | Threshold |
|---|---|
| File size | > 100 MB |
| GPU available | Yes |
| Shannon entropy | < 7.5 bits/byte |
All criteria must pass for GPU dispatch. High-entropy (incompressible) data is rejected to avoid wasting GPU resources.
Running Benchmarks
# Throughput benchmarks (GPU vs CPU decompression)
# Compression ratio benchmarks
# All crush-gpu tests
Development
# Build
# Build with CUDA support
# Test (standard — no GPU required for CI)
# Clippy
# Docs
CUDA Testing
CUDA tests are feature-gated behind #[cfg(feature = "cuda")] and require:
- An NVIDIA GPU with >= 2 GB VRAM
- CUDA Toolkit nvrtc DLLs on
PATH
On Windows with CUDA 13.1, add the runtime libraries to your shell:
# Git Bash / MSYS2
# PowerShell
$env:PATH = "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1\bin\x64;$env:PATH"
Then run:
This executes 10 CUDA-specific tests:
- LZ77 roundtrip (small, 64 KB, multi-tile)
- GDeflate roundtrip (multiple sizes, batch)
- Cancellation (pre-set and mid-batch)
- CUDA vs CPU output parity
- Backend info validation
Full Local GPU Verification
# All crush-gpu tests (wgpu + CUDA)
# Clippy (CUDA mode)
CI Notes
GitHub Actions free tier has no GPU. CI runs cargo clippy --all-targets -- -D warnings and cargo test without the cuda feature. GPU and CUDA tests are local-only.
Note:
nvccdoes not need to be onPATHat build time. Thecudarccrate uses the explicitcuda-13010feature flag instead of auto-detecting the CUDA version.
License
MIT