1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
//! Backward pass execution for the SDDP training loop.
//!
//! `run_backward_pass` sweeps stages in reverse order (`T-2` down to `0`),
//! evaluating the cost-to-go at each trial point **assigned to this rank**
//! during the forward pass. For each trial point, the backward pass iterates
//! over every opening from the fixed opening tree, extracts LP duals to form
//! Benders cut coefficients, and aggregates per-opening outcomes via
//! [`RiskMeasure::aggregate_cut`] to produce one cut per trial point per
//! stage. Each aggregated cut is inserted into the [`crate::FutureCostFunction`].
//!
//! Although [`ExchangeBuffers`] contains trial points from all ranks (after
//! `allgatherv`), each rank only processes its own forward pass assignments
//! to avoid generating duplicate cuts. Cut synchronization (`allgatherv`)
//! distributes the generated cuts to all ranks after the backward pass.
//!
//! ## Stage indexing convention
//!
//! The backward pass generates a cut **at stage `t`** by solving the LP
//! **at stage `t + 1`** (the successor) under each opening noise vector from
//! that successor stage. The opening tree provides noise at `t + 1`.
//!
//! ## Cut coefficient formula
//!
//! For a solve at stage `t + 1` with trial point state `x_hat`:
//!
//! ```text
//! pi[i] = reduced_cost[col_i] / col_scale[col_i] for i in 0..n_state
//! alpha = Q - sum_i(pi[i] * x_hat[i]) (intercept)
//! ```
//!
//! where `Q` is the LP objective, `col_i = state_to_lp_incoming_column(i)` is the
//! bound-pinned incoming-state column, and `reduced_cost[col_i]` is its `HiGHS`
//! reduced cost (the dual of the `lb == ub` pin — equal to the
//! fixing-row dual by KKT stationarity). State fixing uses column bounds, not
//! equality rows; the gradient is a reduced cost,
//! unscaled by `col_scale` (not `row_scale`). The convention is
//! `coefficients = reduced_cost` (raw, no sign flip at extraction). Negation is
//! applied later when building the LP cut row in
//! `cut::row::build_cut_row_batch_into`:
//! `-coeff * x + theta >= intercept`.
//! The intercept formula covers all `n_state` indices uniformly; no special handling
//! for the anticipated-state subrange.
//! See the project convention: "coefficients = dual (NOT -dual)".
//!
//! ### Anticipated-state cut gradient flow
//!
//! Anticipated-state slots are deterministic state variables that connect a decision
//! stage to its delivery stage via the fishing constraint at the delivery stage.
//! `state_to_lp_column` applies a shift-aware mapping for these slots: slot `K_p-1`
//! (the maturation slot for plant `p`) maps to the decision column, while slot
//! `i < K_p-1` maps to slot `i+1` so cuts apply against the correct LP variables
//! at the predecessor stage.
//!
//! The dual of the anticipated-state-fixing row at stage `t` reflects
//! `dQ_t/dx_ant[slot, plant]`. Under the always-active fishing predicate
//! (`StageIndexer::is_anticipated_fishing_active`), every
//! slot at every stage participates in the dual chain: the fishing constraint is
//! emitted at every stage unconditionally. The dual on the Cat 6
//! state-fixing row at slot `s` flows back to the predecessor's LP column via
//! `state_to_lp_column`'s branch decision (Less / Equal / Greater), which maps
//! slot `K_p-1` to the decision column and slot `i < K_p-1` to slot `i+1`.
//! See the `StageIndexer::state_to_lp_column` rustdoc for the slot-to-column
//! branch mapping that drives the dual chain.
//!
//! The backward pass does not call `shift_anticipated_state`. The trial point
//! `x_hat` is the forward-shifted state; cut extraction uses it as-is. This
//! mirrors the inflow-lag convention and is correct: re-shifting would offset
//! `x_hat` relative to the anticipated-state-fixing rows, breaking cut consistency.
//! Empirical verification:
//! `crates/cobre-sddp/tests/anticipated_backward_cut_k1.rs` (K=1) and
//! `crates/cobre-sddp/tests/anticipated_backward_cut_k2.rs` (K=2).
//!
//! ## Cut activity tracking
//!
//! After each backward solve, the duals of the appended cut rows are inspected
//! to determine which existing cuts at the successor stage are binding. The
//! metadata of binding cuts is updated in-place so that cut selection
//! strategies have accurate activity counts at the end of the iteration.
//!
//! ## Thread-level parallelism
//!
//! Within a rank, the outer per-stage loop remains sequential (stage `t`
//! depends on cuts generated at stage `t+1`). The inner trial-point loop is
//! parallelised across [`SolverWorkspace`] instances using rayon's
//! `par_iter_mut` with static scenario partitioning (matching the forward pass).
//! Each worker generates cuts into a thread-local `StagedCut` buffer, sorted
//! by `trial_point_idx` after the parallel region to ensure deterministic FCF
//! insertion regardless of thread completion order.
//!
//! ## Hot-path allocation discipline
//!
//! Allocations are limited to:
//! - One `Vec<f64>` for opening probabilities per stage (outside the trial
//! point loop).
//! - One `Vec<BackwardOutcome>` per worker thread, allocated once per stage
//! in the parallel region and reused via `clear()` per trial point.
//! - One `RowBatch` per stage built by `build_cut_row_batch` (outside the
//! trial point loop, before the parallel region).
//! - One `Vec<StagedCut>` per stage for the merge phase (bounded by
//! `local_work` entries, each holding one cut and its binding slot list).
//!
//! The `binding_slots` vector inside each `StagedCut` is allocated per
//! trial point — a flat buffer optimization is deferred to profiling.
//!
//! Note: `load_model` is called once per trial point (not per stage) to reset
//! the LP to the structural template before appending cuts. `HiGHS` performs
//! internal allocations during `load_model` that are not visible to this
//! module; these are a fixed cost per trial point and are not considered
//! hot-path allocations from Cobre's perspective.
use RowBatch;
use crate::;
// Re-export the crate-internal backward surface so `crate::backward::X`,
// `cobre_sddp::backward::X`, and `training::backward::X` resolve verbatim. The
// dense grouped consumer `use crate::{ backward::{BackwardResult,
// StageOpeningSolver, StageWorkerOpeningDelta, StagedCut, SuccessorSpec,
// process_trial_point_backward} }` in `training/backward_pass_state.rs` reaches
// `StageOpeningSolver` + `process_trial_point_backward` here (the four data types
// are defined directly in this module). `load_backward_lp` and
// `resolve_backward_basis` have no non-test crate consumer — `trial_point` reaches
// them through the `super::lp_setup::` path — so they are re-exported only under
// `#[cfg(test)]` for the in-module tests' `super::load_backward_lp` /
// `super::resolve_backward_basis` call sites.
pub use ;
pub use ;
/// Per-`(rank, worker_id, opening)` solver delta collected during a single
/// backward stage, as returned inside [`BackwardResult::stage_stats`].
///
/// Layout: `(rank, worker_id, opening_index, delta)`.
pub type StageWorkerOpeningDelta = ;
/// Result produced by the backward pass on a single rank.
///
/// The per-worker timing data carried inside `stage_stats` is keyed
/// by the `WORKER_TIMING_SLOT_*` constants exported from
/// `cobre-core`. New per-worker timing slots should be added to
/// that constant set (and the `WORKER_TIMING_SLOT_COUNT` updated)
/// rather than as standalone fields on this struct, so the parquet
/// timing schema picks them up automatically.
/// Per-thread staging buffer for one aggregated cut produced at a single trial
/// point during the parallel backward sweep.
///
/// Each worker thread populates one `StagedCut` per trial point instead of
/// writing directly into the `FutureCostFunction`. After the parallel region,
/// staged cuts are sorted by `trial_point_idx` and merged into the FCF in
/// deterministic order regardless of thread completion order.
pub
/// Per-successor data bundled for `process_stage_backward` and the trial-point helper.
///
/// Groups the successor-specific arguments — including the stage index `t`,
/// opening probabilities, pre-built cut batch, and cut activity metadata —
/// to keep per-function argument counts at or below seven.
pub