Expand description
Reduction Intrinsics for CUDA Code Generation
This module provides DSL intrinsics for efficient parallel reductions that transpile to optimized CUDA code. Reductions aggregate values across threads using operations like sum, min, max, etc.
§Reduction Hierarchy
┌─────────────────────────────────────────────────────────────────┐
│ Grid-Level Reduction │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Block 0 │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │ Warp 0 │ │ Warp 1 │ │ Warp 2 │ │ Warp N │ │ │
│ │ │ shuffle │ │ shuffle │ │ shuffle │ │ shuffle │ │ │
│ │ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │ │
│ │ └──────────────┴──────────┴──────────┘ │ │
│ │ │ shared memory │ │
│ │ ▼ │ │
│ │ block_reduce_sum() │ │
│ │ │ │ │
│ └──────────────────────┼───────────────────────────────────┘ │
│ ▼ atomicAdd │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Global Accumulator (mapped memory) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ grid.sync() / barrier │
│ ▼ │
│ All threads read final result │
└─────────────────────────────────────────────────────────────────┘§Example
ⓘ
// Rust DSL
fn pagerank_phase1(ranks: &[f64], out_degree: &[u32], dangling_sum: &mut f64, n: u32) {
let idx = global_thread_idx();
if idx >= n { return; }
let contrib = if out_degree[idx] == 0 { ranks[idx] } else { 0.0 };
reduce_and_broadcast(contrib, dangling_sum); // All threads get sum
}Structs§
- Reduction
Codegen Config - Configuration for reduction code generation.
Enums§
- Reduction
Intrinsic - Reduction intrinsic types for the DSL.
- Reduction
Op - Reduction operation types for code generation.
Functions§
- generate_
inline_ block_ reduce - Generate inline reduction code (without helper function call).
- generate_
inline_ grid_ reduce - Generate inline grid reduction with atomic accumulation.
- generate_
inline_ reduce_ and_ broadcast - Generate inline reduce-and-broadcast code.
- generate_
reduction_ helpers - Generate CUDA helper functions for reduction operations.
- transpile_
reduction_ call - Transpile a reduction intrinsic call to CUDA code.