warp-types: Type-Safe GPU Warp Programming via Linear Typestate
A type system that prevents shuffle-from-inactive-lane bugs in GPU warp programming by tracking active lane masks at compile time.
Status: Research prototype with real GPU execution. 317 unit + 50 example + 28 doc tests (395 total). Zero runtime overhead verified at Rust MIR, LLVM IR, and NVIDIA PTX levels. Cargo-integrated GPU compilation pipeline. C++20 header (include/warp_types.h) for CUDA/HIP interop.
The Problem
GPU warp primitives like shuffle enable fast intra-warp communication, but reading from an inactive lane is undefined behavior:
if (participate) {
int partner = __shfl_xor_sync(0xFFFFFFFF, data, 1); // BUG
// Reads from inactive lanes — undefined result
}
This compiles without warnings, may appear to work, and fails silently. NVIDIA's own cuda-samples contains this bug (#398). Their core library CUB contains a variant (CCCL#854; the reporter later noted it may have been a false positive, but the source-level pattern is ill-typed in our system regardless). The PIConGPU plasma physics simulation ran for months with undefined behavior in a divergent branch, undetected because pre-Volta hardware masked the violation (#2514). NVIDIA deprecated the entire __shfl API family in CUDA 9.0 to address the bug class.
We survey 21 documented bugs across 16 real-world projects. State-of-the-art persistent thread programs (e.g., the Hazy megakernel) avoid the problem by maintaining warp-uniform execution.
The Solution
Track which lanes are active in the type system. Shuffle only exists on fully-active warps:
use *;
let warp: = kernel_entry;
// After diverge, shuffle is gone from the type:
let = warp.diverge_even_odd;
// evens.shuffle_xor(data, 1); // COMPILE ERROR — method not found
let merged: = merge;
let partner = merged.shuffle_xor; // OK
The type system is strictly more permissive than current best practice (which prohibits divergence) while being strictly safer than the CUDA status quo (which allows it unchecked).
Key Ideas
Warp<S>— Warps carry active set types.Warp<All>= all lanes active.- Diverge produces complements —
divergesplits a warp into two sub-warps with disjoint active sets. - Merge requires complements — Reconvergence is verified at compile time via
ComplementOf<T>trait bounds. - Method availability = safety —
shuffle_xoronly exists onWarp<All>. Not checked — absent. - Zero overhead —
Warp<S>contains onlyPhantomData<S>. Types are erased completely. - Data-dependent divergence —
diverge_dynamic(mask)handles runtime predicates. The mask is dynamic, but the complement is structural — both branches must merge before shuffle. - Cross-function inference — Generic functions take
Warp<S>withS: ActiveSet. Rust infersSat call sites.
The Killer Demo
cuda-samples #398 on the same GPU (RTX 4000 Ada):
Untyped (buggy): sum = 1 ← silent wrong answer
Typed (fixed): sum = 32 ← correct, AND the bug is a compile error
The buggy pattern literally cannot be expressed. Warp<Lane0> has no shfl_down method.
Quick Start
Write GPU Kernels
// my-kernels/src/lib.rs
use *;
// build.rs
// src/main.rs
Claims and Evidence
| Claim | Evidence | Command |
|---|---|---|
| Type safety (diverged warp can't shuffle, merge requires complements) | 11 compile-fail doctests | cargo test --doc |
| Real bug caught at compile time | 7 real-bug + 1 synthetic examples (21 bugs surveyed) | cargo test --examples |
| Hardware reproduction | Deterministic wrong result on H200 SXM, RTX 4000 Ada | bash reproduce/demo.sh |
| Real GPU execution | 4 kernels PASS on H200 SXM and RTX 4000 Ada via cudarc | cd examples/gpu-project && cargo run |
| Shuffle semantics | Verified on H200 (sm_90), RTX 4000 Ada (sm_89), MI300X (gfx942) | bash reproduce/runpod-h200.sh |
| Cargo integration | #[warp_kernel] + WarpBuilder + Kernels struct |
cd examples/gpu-project && cargo run |
| Zero overhead | Verified at MIR, LLVM IR, and PTX levels (search for warp_types_zero_overhead_butterfly in IR) |
bash reproduce/compare_ptx.sh or cargo rustc --release --lib -- --emit=llvm-ir |
| Soundness (progress + preservation) | Full Lean 4 mechanization (31 named theorems), zero sorry, zero axioms | cd lean && lake build |
| CUB-equivalent primitives | Typed reduce, scan, broadcast (8 tests) | cargo test cub |
| Fence-divergence safety | Type-state write tracking (6 tests) | cargo test fence |
| Platform portability (32-lane warp via CpuSimd, 64-lane support) | u64 masks, AMD wavefronts, Platform trait | cargo test warp_size |
| Gradual typing (DynWarp ↔ Warp) | Runtime/compile-time bridge (32 tests) | cargo test gradual |
| All claims | Full test suite (395 tests) | cargo test && cargo test --examples |
Project Structure
warp-types/
├── src/
│ ├── lib.rs # Core exports + GpuValue trait + warp_kernel re-export
│ ├── active_set.rs # Marker types: All, Even, Odd, LowHalf, ... (u64 masks, sealed)
│ ├── warp.rs # Warp<S> — the core parameterized type
│ ├── data.rs # PerLane<T>, Uniform<T>, SingleLane<T, N>
│ ├── diverge.rs # Split warps by predicate
│ ├── merge.rs # Rejoin complementary sub-warps
│ ├── shuffle.rs # Shuffle/ballot/reduce (Warp<All> only) + permutation algebra
│ ├── cub.rs # CUB-equivalent typed warp primitives
│ ├── sort.rs # Bitonic sort (typed warp primitives)
│ ├── tile.rs # Tile-level warp partitioning
│ ├── dynamic.rs # Data-dependent divergence (DynDiverge)
│ ├── gpu.rs # PTX/AMDGPU intrinsics + GpuShuffle trait
│ ├── fence.rs # Fence-divergence type-state machine
│ ├── block.rs # Block-level shared memory + reductions
│ ├── proof.rs # Executable soundness sketch (3 theorems, 3 lemmas)
│ ├── platform.rs # CpuSimd<N> / GpuWarp32 / GpuWarp64 dual-mode
│ ├── gradual.rs # DynWarp ↔ Warp<S> gradual typing bridge
│ ├── simwarp.rs # Multi-lane warp simulator (real shuffle semantics for testing)
│ └── research/ # 24 research exploration modules (incl. warp_size portability)
├── warp-types-macros/ # warp_sets! proc macro (active set hierarchy generation)
├── warp-types-kernel/ # #[warp_kernel] proc macro (GPU kernel entry points)
├── warp-types-builder/ # WarpBuilder (build.rs cross-compilation to PTX)
├── examples/
│ ├── nvidia_cuda_samples_398.rs # Real NVIDIA bug, caught by types
│ ├── cub_cccl_854.rs # CUB/CCCL sub-warp shuffle bug
│ ├── picongpu_2514.rs # PIConGPU months of silent UB
│ ├── llvm_155682.rs # LLVM lane-0 conditional shuffle
│ ├── opencv_12320.rs # OpenCV warpScanInclusive deadlock
│ ├── pytorch_98157.rs # PyTorch __activemask() misuse
│ ├── tvm_17307.rs # TVM LowerThreadAllreduce H100 crash
│ ├── demo_bug_that_types_catch.rs # Synthetic demonstration
│ ├── gpu-project/ # End-to-end cargo→GPU example (Rust host)
│ └── cuda/ # C++ host loading Rust PTX (CUDA Driver API)
├── include/
│ └── warp_types.h # C++20 header: type-safe warp programming for CUDA/HIP
├── reproduce/
│ ├── demo.sh # Full demonstration script
│ ├── host/ # cudarc host runner for real GPU execution
│ └── *.rs, *.cu # PTX comparison + hardware reproduction
├── lean/ # Lean 4 formalization (31 named theorems, zero sorry)
├── paper/ # Preprint (markdown)
├── tutorial/ # Step-by-step tutorial
├── blog/ # Blog post draft
└── INTEGRATION.md # Integration guide
Limitations
- Affine, not linear. Rust's type system is affine (values can be dropped without use). The Lean formalization models linear semantics. In Rust, a
Warp<S>can be silently dropped without merging.#[must_use]warnings catch this in practice, but it is not a hard error. See the paper's Section 6 for discussion. - AMD partially verified. u64 masks and
warp64feature support 64-lane wavefronts. MI300X (gfx942) verified for mask correctness via HIP. Full GPU execution path untested (no amdgcn Rust target yet). NVIDIA verified on H200 SXM (compute 9.0, Hopper) and RTX 4000 Ada (compute 8.9, Ada Lovelace): shuffle semantics, zero-overhead PTX, and typed kernel execution (4/4 kernels PASS on both GPUs). - Nightly required for GPU kernels.
#[warp_kernel]requiresabi_ptxandasm_experimental_archfeatures. The type system itself works on stable Rust.
License
MIT