1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
//! Reduction op family — Phase 4 (Category E).
//!
//! Output shape differs from input: the reduced axis collapses to size
//! 1 (keepdim convention). Single-axis reductions today; full-tensor
//! and multi-axis reductions reduce to repeated single-axis ops —
//! except the broadcast-reverse [`ReduceToPlan`], which collapses
//! every broadcast dim in one launch.
//!
//! Today wired:
//!
//! - **[`ReducePlan`]** — `{Sum, Mean, Max, Min, Prod, Norm2, LogSumExp,
//! Var, Std} × {f32, f16, bf16, f64}` (36 cells). Var / Std use a
//! one-pass Welford kernel; LogSumExp uses a dedicated two-pass kernel
//! (max-then-sum-exp) for numerical stability. [`ReduceBackwardPlan`]
//! covers the same matrix.
//!
//! - **[`ArgReducePlan`]** — `{Argmax, Argmin}` returning `i64` indices.
//! Separate plan because the output dtype differs from the input
//! (index, not value); no BW (non-differentiable through `argmax`).
//!
//! - **[`BoolReducePlan`]** — `{Any, All}` returning `Bool`. Distinct
//! reduction algebra (logical-or / logical-and short-circuit). No BW.
//!
//! - **[`CountReducePlan`]** — `CountNonzero`, value count → `i64`.
//!
//! - **[`TracePlan`]** — scalar `trace(M)` for rank-2 matrices (both
//! axes reduced along the diagonal).
//!
//! - **[`ReduceToPlan`]** — broadcast-reverse `{Sum, Max, Min, Prod} ×
//! {f32, f16, bf16, f64}` (16 cells). Collapses every broadcast dim
//! in one launch (the inverse of a forward `BroadcastTo` — autograd's
//! `ReduceSumTo` / `ReduceMaxTo`); Phase 74 facade over the Phase 31 /
//! 37 `reduce_*_to_*` FFI symbols.
//!
//! All reduction FW + BW are deterministic and bit-stable on the same
//! hardware (no atomic-add; one-block-per-output-cell or one-thread-per-
//! output-cell). f16 / bf16 accumulate in f32 (FP detour); f64 keeps
//! everything in double.
pub use ;
pub use ;
pub use ;
pub use ;
pub use ;
pub use ;
pub use ;