1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
//! TDD Tests for GPU-Resident Tensor Architecture (WAPR-PERF-004)
//!
//! These tests are written FIRST (Test-Driven Development) to define the API
//! before implementation exists. Each test documents a specific requirement.
//!
//! ## Problem Statement
//!
//! Current approach: ~150 host↔device transfers per encoder forward pass
//! Target: 2 transfers total (upload weights once, download output once)
//!
//! ## Root Cause (Five Whys)
//!
//! 1. Why is GPU encoder slower than CPU? → 0.76x speedup (actually slower)
//! 2. Why is CUDA gemm not helping? → ~150 host↔device transfers per pass
//! 3. Why so many transfers? → Data ping-pongs: CPU→GPU for gemm, GPU→CPU for softmax
//! 4. Why not keep data on GPU? → No GPU-resident tensor abstraction
//! 5. What's the fix? → Build GpuResidentTensor into trueno-gpu
//!
//! ## Citations
//!
//! - [Dao2022] FlashAttention: Fast and Memory-Efficient Exact Attention
//! - [Kwon2023] PagedAttention for LLM Serving with vLLM
//! - [Popper1934] The Logic of Scientific Discovery - Falsificationism
// ============================================================================
// Marker module for feature gate
// ============================================================================