# PTX Optimization Passes
This chapter documents the PTX optimization passes in `trueno-gpu`, aligned with NVIDIA's official CUDA Tile IR (CUDA Toolkit 13.1).
## Overview
The `trueno_gpu::ptx::optimize` module provides four optimization passes:
| **FMA Fusion** | `mul + add` → `fma` | Reduced latency, single rounding |
| **Loop Splitting** | Conditional loop splitting | Eliminates branch divergence |
| **Token-Based Ordering** | Memory dependency tracking | Barrier elimination |
| **Tile Validation** | Power-of-two constraints | Prevents register pressure |
## FMA Fusion Pass
The FMA (Fused Multiply-Add) fusion pass detects `mul` + `add` instruction patterns and fuses them into a single `fma` instruction.
### Benefits
- **Latency**: Single instruction instead of two
- **Precision**: Single rounding operation (IEEE 754 compliant)
- **Throughput**: Utilizes GPU FMA units efficiently
### Example
```rust
use trueno_gpu::ptx::optimize::fma_fusion;
use trueno_gpu::ptx::{Operand, PtxInstruction, PtxOp, PtxType, VirtualReg};
// Create mul + add pattern
let r0 = VirtualReg::new(0, PtxType::F32);
let r1 = VirtualReg::new(1, PtxType::F32);
let r2 = VirtualReg::new(2, PtxType::F32);
let r3 = VirtualReg::new(3, PtxType::F32);
let mul = PtxInstruction::new(PtxOp::Mul, PtxType::F32)
.dst(Operand::Reg(r2.clone()))
.src(Operand::Reg(r0.clone()))
.src(Operand::Reg(r1.clone()));
let add = PtxInstruction::new(PtxOp::Add, PtxType::F32)
.dst(Operand::Reg(r3))
.src(Operand::Reg(r2))
.src(Operand::ImmF32(1.0));
// Fuse to single FMA instruction
let fused = fma_fusion::pass(vec![mul, add]);
assert_eq!(fused.len(), 1); // mul + add → fma
```
### Academic Reference
Based on Click & Paleczny (1995) "A Simple Graph-Based Intermediate Representation" for SSA pattern matching.
## Loop Splitting Pass
The loop splitting pass analyzes conditional loops and identifies opportunities to split them at condition boundaries, eliminating branch divergence in GPU warps.
### Heavy Operations
The following operations trigger split profitability:
- `Ld` - Memory loads
- `St` - Memory stores
- `WmmaMma` - Tensor Core MMA
- `WmmaLoadA`, `WmmaLoadB`, `WmmaLoadC` - WMMA fragment loads
- `WmmaStoreD` - WMMA fragment stores
### Example
```rust
use trueno_gpu::ptx::optimize::loop_split;
use trueno_gpu::ptx::{PtxInstruction, PtxOp, PtxType, CmpOp};
// Check profitability
let heavy_op = PtxInstruction::new(PtxOp::Ld, PtxType::F32);
assert!(loop_split::is_split_profitable(&[heavy_op], 10));
let light_op = PtxInstruction::new(PtxOp::Add, PtxType::F32);
assert!(!loop_split::is_split_profitable(&[light_op], 10));
// Split point alignment for non-unit steps
assert_eq!(loop_split::align_split_point(5, 0, 4), 8);
assert_eq!(loop_split::align_split_point(8, 0, 4), 8);
// Loop predicate conversion
assert_eq!(
loop_split::LoopPredicate::from_cmp_op(CmpOp::Lt),
Some(loop_split::LoopPredicate::LessThan)
);
```
### NVIDIA Reference
Aligned with `LoopSplit.cpp` from NVIDIA CUDA Tile IR (CUDA Toolkit 13.1).
## Token-Based Ordering (TKO)
Token-Based Ordering provides explicit memory dependency tracking, enabling compiler-driven barrier elimination.
### Memory Ordering Semantics
| `Weak` | `.weak` | No ordering guarantees |
| `Relaxed` | `.relaxed` | Relaxed consistency |
| `Acquire` | `.acquire` | Acquire semantics |
| `Release` | `.release` | Release semantics |
### Memory Scopes
| `Thread` | `.cta` | Thread-local |
| `Block` | `.cta` | Block-local |
| `Cluster` | `.cluster` | Cluster-local |
| `Device` | `.gpu` | Device-wide |
| `System` | `.sys` | System-wide |
### Example
```rust
use trueno_gpu::ptx::optimize::tko;
// Create tokens for memory operations
let t1 = tko::Token::new();
let t2 = tko::Token::new();
let t3 = tko::Token::new();
// Join tokens at synchronization point
let joined = tko::join_tokens(&[t1, t2, t3]);
// Memory ordering
let ordering = tko::MemoryOrdering::Acquire;
assert_eq!(ordering.to_ptx_modifier(), ".acquire");
// Memory scope
let scope = tko::MemoryScope::Device;
assert_eq!(scope.to_ptx_scope(), ".gpu");
// Token graph with cycle detection
let mut graph = tko::TokenGraph::new();
let ta = tko::Token::new();
let tb = tko::Token::new();
let tc = tko::Token::new();
graph.create_token(ta);
graph.create_token(tb);
graph.create_token(tc);
graph.add_dependency(tb, ta);
graph.add_dependency(tc, tb);
assert!(!graph.has_cycle()); // No deadlock
graph.add_dependency(ta, tc);
assert!(graph.has_cycle()); // DEADLOCK!
```
### NVIDIA Reference
Aligned with `memory_consistency_ops.mlir` from NVIDIA CUDA Tile IR.
## Tile Validation
Tile validation enforces constraints to prevent register pressure issues and compilation hangs.
### Constraints
1. **Power-of-two dimensions**: Required for efficient GPU scheduling
2. **Maximum tile elements**: 16M elements to prevent register spills
3. **Maximum single dimension**: 4096 to prevent degenerate shapes
### WMMA Valid Shapes
| `M16N16K16` | Standard 16×16×16 |
| `M8N32K16` | Alternate 8×32×16 |
| `M32N8K16` | Alternate 32×8×16 |
### Example
```rust
use trueno_gpu::ptx::optimize::tile_validation;
use trueno_gpu::ptx::WmmaShape;
// Valid shapes
assert!(tile_validation::validate_shape(&[16, 16]).is_ok());
assert!(tile_validation::validate_shape(&[32, 32]).is_ok());
assert!(tile_validation::validate_shape(&[64, 64]).is_ok());
// Invalid shapes
assert!(tile_validation::validate_shape(&[17, 16]).is_err()); // Not power of two
assert!(tile_validation::validate_shape(&[100, 100]).is_err());
// WMMA shapes
let valid_wmma = WmmaShape::M16N16K16;
assert!(tile_validation::validate_wmma_shape(&valid_wmma).is_ok());
let invalid_wmma = WmmaShape { m: 24, n: 24, k: 16 };
assert!(tile_validation::validate_wmma_shape(&invalid_wmma).is_err());
```
### Academic Reference
Based on Volkov & Demmel (2008) "Benchmarking GPUs to Tune Dense Linear Algebra".
## Running the Example
```bash
cargo run --example ptx_optimize
```
Output:
```
╔══════════════════════════════════════════════════════════════╗
║ PTX Optimization Passes (NVIDIA CUDA Tile IR Aligned) ║
╚══════════════════════════════════════════════════════════════╝
1️⃣ FMA FUSION PASS
Input: 2 instructions (mul + add)
Output: 1 instruction (fma)
2️⃣ LOOP SPLITTING PASS
Heavy ops trigger split: true
Light ops trigger split: false
3️⃣ TOKEN-BASED ORDERING (TKO)
Tokens created with unique IDs
Cycle detection: working
4️⃣ TILE VALIDATION
Power-of-two shapes: OK
Invalid shapes: rejected
✅ All optimization demos completed successfully!
```
## Specification
Full specification: [cuda-tile-behavior.md](../../../docs/specifications/cuda-tile-behavior.md) (v1.4.0)
## Coverage
| `fma_fusion.rs` | 93.75% |
| `loop_split.rs` | 99.80% |
| `tko/mod.rs` | 94.29% |
| `tile_validation.rs` | 88.64% |
| **Total** | **94.28%** |