1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
//! Prefill (batch) GPU dispatch for OxiBonsai — CUDA backend.
//!
//! Mirrors [`metal_prefill`] for Linux/Windows. Handles batch processing of
//! multiple tokens during prompt prefill using GEMM kernels.
//!
//! # Architecture
//!
//! - [`CudaPrefillBuffers`]: Pre-allocated GPU buffers sized for `batch_size` tokens.
//! - [`CudaPrefillModules`]: Compiled CUDA functions for the 5 prefill kernels.
//! - `encode_prefill_ffn_phase`: Batched FFN pipeline (RMSNorm → gate+up+SwiGLU → down).
//! - `encode_prefill_layer`: One full prefill transformer layer.
//! - [`try_cuda_prefill`]: Public entry point mirroring `try_metal_full_forward_prefill`.
//!
//! # Module structure
//!
//! Phase 29 split the monolithic `cuda_prefill.rs` (1989 lines) into focused
//! sub-modules; all external `super::cuda_prefill::*` access paths are
//! preserved through the re-exports below.
//!
//! - [`state`]: types ([`CudaPrefillBuffers`], [`CudaPrefillModules`]),
//! singleton state, [`init_prefill_modules`], and buffer-acquisition
//! helpers.
//! - [`launchers`]: thin `unsafe fn` wrappers around `launch_builder()`
//! for the 7 prefill kernels (4 Q1 + 3 TQ2).
//! - [`encode_q1`]: Q1 (1-bit) FFN + full-layer encoders.
//! - [`encode_ternary`]: TQ2 (ternary) FFN + full-layer encoders.
//! - [`try_apis`]: public [`try_cuda_prefill`] / [`try_cuda_prefill_ternary`]
//! entry points.
//!
//! # Batch tensor layout
//!
//! All batched buffers use **column-major** layout: `buf[col * dim + element]`
//! where `col` is the batch/token index. This matches the Metal MSL kernels.
//!
//! # Attention in the prefill path
//!
//! We do not have a batched attention kernel; attention is processed sequentially
//! per token using the existing single-token attention kernels from `cuda_full_layer`.
pub use ;
pub use ;