Module reduction_intrinsics

Expand description

Reduction Intrinsics for CUDA Code Generation

This module provides DSL intrinsics for efficient parallel reductions that transpile to optimized CUDA code. Reductions aggregate values across threads using operations like sum, min, max, etc.

§Reduction Hierarchy

┌─────────────────────────────────────────────────────────────────┐
│                     Grid-Level Reduction                        │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │                    Block 0                                │  │
│  │  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐         │  │
│  │  │ Warp 0  │ │ Warp 1  │ │ Warp 2  │ │ Warp N  │         │  │
│  │  │ shuffle │ │ shuffle │ │ shuffle │ │ shuffle │         │  │
│  │  └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘         │  │
│  │       └──────────────┴──────────┴──────────┘             │  │
│  │                      │ shared memory                     │  │
│  │                      ▼                                   │  │
│  │              block_reduce_sum()                          │  │
│  │                      │                                   │  │
│  └──────────────────────┼───────────────────────────────────┘  │
│                         ▼ atomicAdd                            │
│  ┌─────────────────────────────────────────────────────────┐  │
│  │              Global Accumulator (mapped memory)          │  │
│  └─────────────────────────────────────────────────────────┘  │
│                         │ grid.sync() / barrier                │
│                         ▼                                      │
│              All threads read final result                     │
└─────────────────────────────────────────────────────────────────┘

§Example

// Rust DSL
fn pagerank_phase1(ranks: &[f64], out_degree: &[u32], dangling_sum: &mut f64, n: u32) {
    let idx = global_thread_idx();
    if idx >= n { return; }

    let contrib = if out_degree[idx] == 0 { ranks[idx] } else { 0.0 };
    reduce_and_broadcast(contrib, dangling_sum);  // All threads get sum
}

Structs§

ReductionCodegenConfig: Configuration for reduction code generation.

Enums§

ReductionIntrinsic: Reduction intrinsic types for the DSL.
ReductionOp: Reduction operation types for code generation.

Functions§

generate_inline_block_reduce: Generate inline reduction code (without helper function call).
generate_inline_grid_reduce: Generate inline grid reduction with atomic accumulation.
generate_inline_reduce_and_broadcast: Generate inline reduce-and-broadcast code.
generate_reduction_helpers: Generate CUDA helper functions for reduction operations.
transpile_reduction_call: Transpile a reduction intrinsic call to CUDA code.

Module reduction_intrinsics

Module reduction_intrinsics Copy item path

§Reduction Hierarchy

§Example

Structs§

Enums§

Functions§

Module reduction_intrinsics