1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
//! Pooling op family — Phase 7 Milestone 7.2 (Category Pooling / J).
//!
//! Wraps cuDNN's legacy descriptor-based pooling API. The trailblazer is
//! 2-D NCHW pooling — both max-pool ([`MaxPool2dPlan`]) and average-pool
//! ([`AvgPool2dPlan`]). 1-D / 3-D pooling, adaptive pooling, LP-pool,
//! and fractional-max-pool follow in fanout milestones.
//!
//! ## Plan layout
//!
//! - [`MaxPool2dPlan`] / [`AvgPool2dPlan`] — each owns one cuDNN handle
//! plus three lazy descriptors (`x_desc`, `y_desc`, `pool_desc`)
//! created on first `run_fw` and reused across launches. No workspace
//! caches — pooling is **workspace-free** in cuDNN's legacy API.
//!
//! ## Average-pool padding modes
//!
//! cuDNN exposes two avg-pool flavors:
//!
//! - `AVERAGE_COUNT_INCLUDE_PADDING` — divide by `window_h * window_w`
//! (zero-padded cells count toward the denominator). Matches
//! TensorFlow's default.
//! - `AVERAGE_COUNT_EXCLUDE_PADDING` — divide only by the count of
//! *valid* (non-padded) cells in each window. **PyTorch's
//! `nn.AvgPool2d` default** (`count_include_pad=False`).
//!
//! [`AvgPool2dPlan`] dispatches on the [`PoolMode`] field of the
//! descriptor; the trailblazer defaults the convenience constructor to
//! `AvgExcludePad` for PyTorch parity.
//!
//! ## Backward pass
//!
//! [`Pool2dBwArgs`] carries **both** `y` (saved FW output) and `x`
//! (saved FW input) because cuDNN's pooling-BW API requires both — for
//! max-pool it uses them to recover the per-window argmax (no separate
//! indices tensor is materialized by the legacy API); for avg-pool the
//! gradient depends only on `x` but cuDNN still demands `y` for API
//! uniformity. Callers must retain `y` and `x` from the FW launch.
//!
//! ## Handle ownership
//!
//! Each plan lazily owns one `cudnnHandle_t` in a `Cell<>` (created on
//! first `run`; bound to the caller's stream on every launch so the
//! plan is reusable across streams). cuDNN handles are **not** thread-
//! safe — the plan is `!Sync` / `!Send` by virtue of the `Cell<>` it
//! holds. The handle and all descriptors are released in `Drop`.
//!
//! ## Workspace
//!
//! [`Workspace::None`] suffices — cuDNN's pooling kernel allocates its
//! small internal scratch itself. The `run_*` methods accept any
//! `Workspace<'_>` for caller convenience but never read from it.
//!
//! ## Output spatial extents
//!
//! Computed by both plans as
//! `H_out = floor((H_in + 2·pad_h - window_h) / stride_h) + 1`,
//! and similarly for `W_out`. This matches PyTorch / cuDNN convention
//! (no `ceil_mode` knob in the trailblazer — that's a fanout extension).
//!
//! ## Dtype coverage
//!
//! `f32`, `f64`, `f16`, `bf16` — the four cuDNN-supported FP types for
//! pooling. The cuDNN alpha/beta scalar dtype is `f32` for `f32` /
//! `f16` / `bf16` operands and `f64` for `f64` operands.
// 2-D pooling (Phase 7 Milestone 7.2) — original trailblazers.
pub use AvgPool2dPlan;
pub use MaxPool2dPlan;
// Shared descriptor / args / mode types live in the max_pool2d module
// (which gets compiled first) and are re-exported here so callers can
// reach for `pool::Pool2dDescriptor` regardless of which plan they pick.
pub use ;
// 1-D pooling (Phase 11.8 / Fuel feedback #9).
pub use AvgPool1dPlan;
pub use ;
// 3-D pooling (Phase 11.8).
pub use AvgPool3dPlan;
pub use ;
// Adaptive pooling family (Phase 11.8, cuDNN approximation — see
// per-module rustdoc for the bit-exact-PyTorch caveat).
pub use ;
pub use ;
pub use ;
pub use AdaptiveMaxPool1dPlan;
pub use AdaptiveMaxPool2dPlan;
pub use AdaptiveMaxPool3dPlan;
// Fractional max-pool — bespoke kernel (Phase 16.3). FW + BW × 4 FP
// dtypes; caller supplies a `[N, C, num_axes]` f32 `random_samples`
// tensor and retains a saved-`indices` i64 tensor between FW and BW.
pub use ;
pub use ;
// LpPool — bespoke fused kernel (Phase 16.2). FW + BW × 4 FP dtypes.
// cuDNN has no native LpPool; the fused kernel does the full
// `y = (Σ |x|^p)^(1/p)` reduction in one launch (avoids the
// pow → avg_pool → pow stack + the missing parameterized Pow(p) plan).
pub use ;
pub use ;
pub use ;
pub use ;
use ;
/// Shared status-code mapper for the LpPool family. Mirrors the sort/
/// indexing family's `map_status` pattern.
pub