Skip to main content

Module reduction_intrinsics

Module reduction_intrinsics 

Source
Expand description

Reduction Intrinsics for CUDA Code Generation

This module provides DSL intrinsics for efficient parallel reductions that transpile to optimized CUDA code. Reductions aggregate values across threads using operations like sum, min, max, etc.

§Reduction Hierarchy

┌─────────────────────────────────────────────────────────────────┐
│                     Grid-Level Reduction                        │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │                    Block 0                                │  │
│  │  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐         │  │
│  │  │ Warp 0  │ │ Warp 1  │ │ Warp 2  │ │ Warp N  │         │  │
│  │  │ shuffle │ │ shuffle │ │ shuffle │ │ shuffle │         │  │
│  │  └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘         │  │
│  │       └──────────────┴──────────┴──────────┘             │  │
│  │                      │ shared memory                     │  │
│  │                      ▼                                   │  │
│  │              block_reduce_sum()                          │  │
│  │                      │                                   │  │
│  └──────────────────────┼───────────────────────────────────┘  │
│                         ▼ atomicAdd                            │
│  ┌─────────────────────────────────────────────────────────┐  │
│  │              Global Accumulator (mapped memory)          │  │
│  └─────────────────────────────────────────────────────────┘  │
│                         │ grid.sync() / barrier                │
│                         ▼                                      │
│              All threads read final result                     │
└─────────────────────────────────────────────────────────────────┘

§Example

// Rust DSL
fn pagerank_phase1(ranks: &[f64], out_degree: &[u32], dangling_sum: &mut f64, n: u32) {
    let idx = global_thread_idx();
    if idx >= n { return; }

    let contrib = if out_degree[idx] == 0 { ranks[idx] } else { 0.0 };
    reduce_and_broadcast(contrib, dangling_sum);  // All threads get sum
}

Structs§

ReductionCodegenConfig
Configuration for reduction code generation.

Enums§

ReductionIntrinsic
Reduction intrinsic types for the DSL.
ReductionOp
Reduction operation types for code generation.

Functions§

generate_inline_block_reduce
Generate inline reduction code (without helper function call).
generate_inline_grid_reduce
Generate inline grid reduction with atomic accumulation.
generate_inline_reduce_and_broadcast
Generate inline reduce-and-broadcast code.
generate_reduction_helpers
Generate CUDA helper functions for reduction operations.
transpile_reduction_call
Transpile a reduction intrinsic call to CUDA code.