# Profiling
## The Real Profiling Mandate
Trueno enforces a strict **"Real Profiling"** mandate. All performance metrics reported by the ecosystem MUST be measured, not derived.
> **Forbidden**: Calculating per-brick time by taking total throughput and multiplying by a budget fraction.
> **Required**: Measuring start/end times for every operation, with full synchronization.
### Why?
Simulated or derived metrics mask bottlenecks. If you assume an operation takes 10% of the time, you will never discover when it actually takes 50% due to a regression.
## BrickProfiler v2 (PAR-200)
The `BrickProfiler` is the core profiling tool built into `trueno`. Version 2 (PAR-200) introduces O(1) hot-path profiling with deferred sync support.
### Key Features
| Brick lookup | HashMap<String> O(n) | BrickId enum O(1) |
| GPU sync | Immediate (~200% overhead) | Deferred (~5% overhead) |
| Category aggregation | Manual | Automatic (Norm, Attention, FFN) |
### BrickId Enum
```rust
use trueno::{BrickProfiler, BrickId, BrickCategory, SyncMode};
let mut profiler = BrickProfiler::new();
profiler.enable();
// O(1) brick timing with enum-based lookup
let timer = profiler.start_brick(BrickId::QkvProjection);
// ... perform QKV projection ...
profiler.stop_brick(timer, num_elements);
// Category breakdown
let cats = profiler.category_stats();
println!("Attention: {:.1}%", cats[BrickCategory::Attention as usize].percentage(profiler.total_ns()));
```
### Deferred Sync Mode
For GPU workloads, immediate synchronization after every operation adds ~200% overhead. Deferred sync batches measurements:
```rust
profiler.set_sync_mode(SyncMode::Deferred);
profiler.reset_epoch();
// Record without sync (timestamps only)
let start = profiler.elapsed_ns();
// ... GPU kernel launch ...
profiler.record_deferred(BrickId::AttentionScore, start, elements);
// Single sync point at end of layer/batch
let end = profiler.elapsed_ns();
profiler.finalize(end); // Apply all pending measurements
```
**Sync Modes:**
- `Immediate`: Sync after every brick (~200% overhead, accurate per-brick)
- `PerLayer`: Sync once per transformer layer (~20% overhead)
- `Deferred`: Single sync at batch end (~5% overhead)
- `None`: No timing (0% overhead, for production)
### Running the Example
```bash
cargo run --example brick_profiler_v2
```
Output:
```
=== PAR-200: BrickProfiler v2 Demo ===
Per-Brick Timing:
Brick Avg (µs) Total (µs) Count
----------------------------------------------------
RmsNorm 104.6 313.9 3
QkvProjection 253.8 761.3 3
...
Category Breakdown:
Category Avg (µs) Pct Samples
--------------------------------------------
Norm 104.6 8.3% 3
Attention 228.7 36.4% 6
FFN 348.0 55.3% 6
```
### Integration with Realizar
The realizar inference engine integrates BrickProfiler v2:
```rust
// In CudaExecutor
executor.set_profiler_sync_mode(SyncMode::Deferred);
// During forward pass
let timer = executor.start_brick_id(BrickId::QkvProjection);
// ... kernel launch ...
executor.stop_brick_id(timer, hidden_dim as u64);
```
## Falsification Protocols (F101-F110)
To prove profiling is real, we apply Popperian Falsification:
1. **F101**: `BrickId::COUNT == 15` (all brick types defined)
2. **F102**: Category mapping correct for all BrickIds
3. **F103**: Deferred mode accumulates pending measurements
4. **F104**: `finalize()` clears pending queue
5. **F105**: Zero-overhead when disabled
6. **F106**: Array indexing is O(1)
7. **F107**: Thread-safe (Send + Sync)
8. **F108**: BrickIdTimer fits in 32 bytes
9. **F109**: `elapsed_ns()` monotonic
10. **F110**: Category stats sum correctly
## Execution Path Graph (PAR-201)
BrickProfiler v2 also supports execution path graphs for tracking the full hierarchy:
```
Layer(0)
└─► Brick(QkvProjection) ─────► Kernel(batched_q4k_gemv, ptx_hash=0x7a3b...)
│ └─► PTX source lookup
└─► Brick(AttentionScore) ────► Kernel(incremental_attention, ptx_hash=0x9f1c...)
```
### Enabling Graph Recording
```rust
use trueno::{BrickProfiler, BrickId, ExecutionNode, PtxRegistry};
let mut profiler = BrickProfiler::new();
profiler.enable();
profiler.enable_graph(); // Enable execution graph tracking
// Record layer scope
profiler.graph_push_scope(ExecutionNode::Layer { index: 0 });
// Record brick
let timer = profiler.start_brick(BrickId::QkvProjection);
// ... work ...
profiler.stop_brick(timer, elements);
profiler.graph_record_brick(BrickId::QkvProjection, timing_ns, elements);
// Record kernel launch
profiler.graph_record_kernel(
"batched_q4k_gemv",
ptx_hash,
(32, 1, 1), // grid
(256, 1, 1), // block
4096, // shared_mem
);
profiler.graph_pop_scope();
```
### PTX Registry
Track PTX source code for debugging:
```rust
let mut registry = PtxRegistry::new();
registry.register("kernel_name", ptx_source, Some(Path::new("src/kernel.ptx")));
// Lookup by hash
let hash = PtxRegistry::hash_ptx(ptx_source);
let source = registry.lookup(hash);
```
### Visualization
**Option 1: Headless ASCII Tree (CI/CD, Automation)**
Zero-dependency tree visualization for testing and automation:
```rust
let graph = profiler.execution_graph();
let tree = graph.to_ascii_tree();
println!("{}", tree);
// Output can be used for:
// - Snapshot tests (deterministic output)
// - CI/CD logs
// - File export
std::fs::write("execution_tree.txt", &tree)?;
```
Output:
```
Layer 0
├── RmsNorm 50.0µs (4096 elem)
│ └── rmsnorm_kernel <<<16,256,1>>> smem=1024B
├── QkvProjection 200.0µs (4096 elem)
│ └── batched_q4k_gemv <<<32,256,1>>> smem=4096B
```
**Option 2: Interactive TUI (presentar-terminal)**
Native TUI widget for interactive exploration (requires `presentar-tui` feature):
```rust
use trueno::ExecutionGraph;
use presentar_terminal::{Tree, TuiApp};
// Convert execution graph to tree widget
let tree_node = profiler.execution_graph().to_tree_node();
let tree = Tree::new().with_root(tree_node).expand_all();
// Use in TUI app or render headless via HeadlessCanvas
```
**Option 3: Graphviz DOT Export**
Export to Graphviz DOT format for SVG rendering:
```bash
# In code
let dot = profiler.graph_to_dot();
std::fs::write("graph.dot", dot)?;
# Visualize
dot -Tsvg graph.dot -o graph.svg
```
### Running the Example
```bash
# Headless ASCII tree (default, no dependencies)
cargo run --example execution_graph
# With presentar-terminal TreeNode
cargo run --example execution_graph --features presentar-tui
```
## Backend-Specific Profiling (CPU/SIMD/GPU)
Different compute backends require different profiling approaches. See the full specification in `docs/specifications/ml-tuner-bricks.md` (Appendix E.8).
### Instrumentation Status
| CUDA | `CudaExecutor::forward()` | Full | ~5% (deferred) |
| CPU | `forward()` | **None** | N/A |
| CPU | `forward_profiled()` | Full | ~10% |
| SIMD | trueno ops | Per-op | ~2% |
**Key Insight**: The legacy CPU `forward()` function lacks BrickProfiler instrumentation. For CPU profiling, use `forward_profiled()` or add instrumentation manually.
### SIMD Backend Profiling
Profile SIMD operations at the brick level:
```rust
use trueno::{BrickProfiler, BrickId};
let mut profiler = BrickProfiler::new();
profiler.enable();
// Profile SIMD operation
let timer = profiler.start_brick(BrickId::RmsNorm);
trueno::simd::rms_norm_avx2(&input, &mut output); // AVX2 backend
profiler.stop_brick(timer, input.len() as u64);
// Get throughput
let stats = profiler.stats_for(BrickId::RmsNorm);
let throughput = stats.total_elements as f64 / stats.total_ns as f64 * 1000.0;
println!("RmsNorm: {:.2} Melem/s", throughput);
```
### Backend Comparison
Compare performance across backends:
```rust
use trueno::{BrickProfiler, BrickId, detect_backend, Backend};
let backend = detect_backend();
let mut profiler = BrickProfiler::new();
profiler.enable();
// Same brick, different backends
match backend {
Backend::Avx512 => { /* AVX-512 path */ }
Backend::Avx2 => { /* AVX2 path */ }
Backend::Neon => { /* ARM NEON path */ }
_ => { /* Scalar fallback */ }
}
// Report includes backend name
println!("Backend: {:?}", backend);
println!("{}", profiler.report());
```
### Backend-Specific Roofline
Different backends have different theoretical peaks:
| RTX 4090 | 83.0 | 1008 |
| AVX-512 | ~2.0 | ~100 |
| AVX2 | ~0.5 | ~50 |
| ARM NEON | ~0.2 | ~40 |
| Scalar | ~0.1 | ~25 |
```rust
// Backend-aware roofline distance
let distance = match backend {
Backend::Cuda => graph.roofline_distance(83.0, 1008.0),
Backend::Avx512 => graph.roofline_distance(2.0, 100.0),
Backend::Avx2 => graph.roofline_distance(0.5, 50.0),
_ => graph.roofline_distance(0.1, 25.0),
};
```
### Critical Path Analysis (Phase 9)
Identify true bottlenecks vs parallelizable work:
```rust
let graph = profiler.execution_graph();
// Get critical path
let (critical_path, total_ns) = graph.critical_path();
println!("Critical path: {} nodes, {:.2}ms", critical_path.len(), total_ns as f64 / 1_000_000.0);
// Find parallelization opportunities
let slack = graph.compute_slack();
for (node_id, slack_ns) in &slack {
if *slack_ns > 0 {
println!("Node {} can be parallelized (slack: {}µs)", node_id.0, slack_ns / 1000);
}
}
// Formatted summary
println!("{}", graph.critical_path_summary());
```
## Tools
- **presentar-terminal Tree**: Native TUI tree widget for hierarchical execution graphs.
- **cbtop**: The primary visualization tool for ComputeBrick pipelines. Supports backend-specific profiling display.
- **perf / flamegraph**: For CPU-side overhead analysis.
- **nsight**: For deep GPU kernel inspection (external to the pure Rust stack).