1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
// SPDX-FileCopyrightText: 2026 John Moxley
// SPDX-License-Identifier: MIT OR Apache-2.0
//! Division policy — the divisor-shape algorithm matcher.
//!
//! Canonical policy shape (see `docs/ARCHITECTURE.md` → "Policy file
//! structure"), with one twist: division is the **one policy with no
//! const-width axis**. Its operands have *independent* runtime lengths that
//! no single level const expresses — the decimal `/` divides a `2N`-limb
//! scaled numerator by an `N`-limb divisor, and the slice roots
//! (`isqrt_newton` / `icbrt_newton` / `newton_reciprocal`) divide bare
//! `&[u64]` of runtime length with no `N` in their types at all. So unlike
//! a unary (`select<N>`) or binary (`select<Nthis, Nother>`) policy,
//! [`select`] here is **non-generic**: it always returns [`Select::ByShape`],
//! delegating the whole choice to the runtime [`select_for_limbs`]. (Forcing
//! a `<N>` would make the slice roots manufacture a const they don't have —
//! the kind of caller-side specialisation the architecture forbids — and
//! the divide doesn't use `N` anyway, since its engine choice is runtime.)
//!
//! Two selectors: [`select`] (the const matcher — here a no-op `ByShape`)
//! and [`select_for_limbs`] (the runtime limb-shape decision it delegates
//! to). The engines stay pure — each takes an already-chosen algorithm.
//! This file owns the *choice*: the benched crossover threshold
//! ([`BZ_THRESHOLD`]) is policy DATA here, not a magic number in a kernel.
use cratediv_burnikel_ziegler_with_knuth;
use cratediv_knuth;
use cratediv_knuth_u128_limb;
use cratediv_rem;
use cratediv_rem_schoolbook;
// ── 1. the real division engines — NAMED, no `Default` ────────────────
/// The division engines the divisor-shape matcher chooses between.
/// Variants are the CamelCase of each engine fn's name minus the `div_`
/// function prefix (`div_knuth` → `Knuth`, …) — strict 1:1 with the
/// engine fns in [`crate::int::algos::div`].
///
/// `pub(crate)` together with [`select_for_limbs`] so a concrete-`N` caller
/// that needs **exact `ComputeLimbs` scratch** (the decimal `/` and `%` int
/// layers) can read the matcher's verdict and route to the chosen engine's
/// `*_into` variant with its own buffers — rather than hardcoding one engine
/// (the matcher-bypass defect). The slice [`dispatch`] is the build-max entry
/// for callers that don't size scratch. Exposing the verdict (not a
/// `dispatch_into`, which would have to thread `N` through this slice policy)
/// keeps the division policy width-free.
pub
// ── 2. the verdict ────────────────────────────────────────────────────
/// A settled engine, or "the (runtime) limb shape decides". For division
/// every `N` resolves to [`Select::ByShape`] — the engine choice is
/// determined by the operands' effective limb counts, known only at run
/// time — so the `ByShape` arm delegates to [`select_for_limbs`].
/// [`Select::ByAlgorithm`] is the canonical alternative (a width-keyed
/// fixed engine); it is unused by this policy today but kept so `select<N>`
/// could pin an engine for some `N` range without changing the verdict
/// type.
// ── policy data: the benched crossover threshold ──────────────────────
/// Burnikel–Ziegler engagement threshold, in u64 limbs. **Not a
/// routing threshold** — [`select_for_limbs`] does not return BZ (the
/// policy-map showed it losing across the whole supported surface;
/// the den_n ≥ 65 region routes to [`Algorithm::KnuthU128Limb`] instead).
/// This const is read ONLY by [`div_burnikel_ziegler_with_knuth`]'s own
/// engagement guard (keeping BZ reachable via its `#[cfg(test)]` differential
/// + the bench seam): a divisor of at least this many effective limbs whose
/// dividend is `≥ 2·n` runs the BZ chunking, otherwise it shorts to Knuth.
/// **Policy data** — the kernels take an already-chosen engine and never
/// see this number.
///
/// **Benched optimum (Knuth-vs-BZ crossover, `div_kernel_ab`).** The
/// shipped `_with_knuth` engine is *block* division (it splits the
/// dividend into `n`-limb chunks and runs a full Knuth divide of each
/// `chunk‖carry` by the divisor), not recursive fast-division, so it has
/// no asymptotic edge over a single Knuth pass at the supported widths.
/// An A/B of Knuth vs the (forced) chunking core on the `div`-shaped
/// `wide_num` operands (`2n`-limb dividend over an `n`-limb divisor)
/// shows **Knuth wins at every width with no crossover**:
///
/// | divisor (limbs / tier) | Knuth vs BZ (wide_num) |
/// |------------------------|------------------------|
/// | 3 / D57 | Knuth 1.55× faster |
/// | 4 / D76 | Knuth 1.85× faster |
/// | 6 / D115 | Knuth 1.54× faster |
/// | 8 / D153 | Knuth 1.43× faster |
/// | 12 / D230 | Knuth 1.38× faster |
/// | 16 / D307 | Knuth 1.27× faster |
/// | 24 / D462 | Knuth 1.18× faster |
/// | 32 / D616 | Knuth 1.12× faster |
/// | 48 / D924 | Knuth 1.08× faster |
/// | 64 / D1232 | Knuth 1.01–1.06× faster|
///
/// The margin narrows with width but never crosses (an exploratory 96-
/// and 128-limb probe still favours Knuth ~1.10×, and the curve has
/// plateaued — no crossover exists at any reachable width). The
/// `balanced` (square `rem`/`div_rem`) shape never meets the
/// `num_m ≥ 2·den_n` gate and favours Knuth ~1.4× throughout.
///
/// Therefore the optimum is to **never engage** the block engine within
/// the supported range: the widest storage tier is D1232 = 64 limbs (a
/// cross-scale dividend reaches 128 limbs), so a threshold of `65`
/// guarantees every supported divide takes the faster Knuth engine while
/// leaving the engine + gate intact for a future true recursive-BZ
/// kernel. (Lowering toward `8`/`16` would *regress* every
/// D307+ wide divide by engaging the slower block engine.)
pub const BZ_THRESHOLD: usize = 65;
/// u128-limb Knuth ([`Algorithm::KnuthU128Limb`]) engagement threshold, in
/// u64 divisor limbs. **Policy data.**
///
/// **Benched** (`div_kernel_ab`, u128 base-2¹²⁸ vs u64 base-2⁶⁴, wide `2n`/`n`
/// shape — the decimal `/` scaled-numerator shape; the limb-width win
/// materialises only on this shape). After the q̂-reciprocal was hoisted out
/// of the per-digit loop (~12–15% faster u128), the policy-map
/// bisected the even-`n` crossover cleanly between 22 and 24:
///
/// | den_n | wide `2n`/`n` |
/// |-------|----------------|
/// | 16 | u64 2.46× |
/// | 18 | u64 1.07× |
/// | 20 | tie (1.01×) |
/// | 22 | u64 1.02× |
/// | 24 | **u128 1.11×** |
/// | 26 | **u128 1.10×** |
/// | 30 | **u128 1.35×** |
/// | 32 | **u128 1.42×** |
///
/// Clean u128 win from **den_n = 24** upward (cross-checked: den_n=24 is u128
/// 1.10–1.11× across two independent runs). u128
/// is routed for an **even** divisor `≥ 24` limbs with a `≥ 2·n` dividend,
/// INCLUDING the den_n ≥ 65 working widths the decimal `÷10^w` rescale + wide
/// transcendentals reach (same map: u128 wins 1.68–1.78× over
/// Burnikel–Ziegler at den_n 96/128). The balanced shape (square `rem` /
/// the `Int<N>` `/` operator) and every narrow/odd divisor stay base-2⁶⁴
/// Knuth, where u128 loses ~1.5×. The engine itself falls back to `div_knuth`
/// for odd / `< 4`-limb divisors, so the matcher gate is the perf carve-out.
const U128_DIV_THRESHOLD: usize = 24;
// ── 3. the matcher: `select` (no const axis) → `select_for_limbs` ─────
/// The top-level matcher. Division has no const-width axis (its operands'
/// lengths are independent runtime values — see the module docs), so unlike
/// a `select<N>` unary policy this is **non-generic** and always defers the
/// choice to the runtime [`select_for_limbs`]. A future limb refinement
/// (e.g. routing an even, wide divisor to a u128-limb engine) is a **runtime
/// arm inside `select_for_limbs`** — gated on the runtime `den_n`, where the
/// width information actually is — NOT a const verdict here.
const
/// Select the division engine for an operand pair's **limb shape**. The
/// sibling of [`select`]: `select` keys on the const width, this keys on
/// the runtime effective limb counts, which it computes itself:
///
/// It works the counts out itself, and **only the ones a branch needs** —
/// passing raw slices (rather than pre-computed counts from [`dispatch`])
/// means the dividend is never walked on the paths that don't look at it:
///
/// * `den_n` — the **divisor's** effective limb count (Knuth's `n`):
/// `den.len()` with trailing zero limbs stripped. `den_n == 0` is a
/// divide-by-zero (asserted here). Always needed.
/// * the **dividend's** effective limb count (`num.len()` with top zero
/// limbs stripped) is computed **lazily**, only once the divisor reaches
/// the smaller [`U128_DIV_THRESHOLD`] gate (both wide engines want a
/// `≥ 2·n` dividend). The common cases (single-limb divisor → `Rem`; any
/// `2..U128_DIV_THRESHOLD`-limb divisor → `Knuth`) never strip the
/// dividend at all.
///
/// Routing: a single-limb divisor takes the hardware [`Algorithm::Rem`]
/// path (covers every `10^scale`, `scale ≤ 19`); a wide (`≥ 2·n` dividend)
/// EVEN divisor of at least [`U128_DIV_THRESHOLD`] limbs takes the u128-limb
/// Knuth engine; everything else takes base-2⁶⁴ Knuth.
pub
/// Effective limb count of a little-endian magnitude slice: its length with
/// trailing (most-significant) zero limbs stripped — `0` for an all-zero
/// slice.
// ── 4. the dispatcher: fold `select<N>`, run the selector, route ──────
/// Runtime divide dispatcher — the single entry every multi-limb divide
/// flows through. Folds the `select<N>` verdict (const per monomorphisation),
/// runs the runtime [`select_for_limbs`], and routes to the chosen engine;
/// `quot` / `rem` are written by that engine.
///
/// Slice-based (no `<N>`): the numerator and divisor have *independent*
/// runtime lengths that no single const width expresses — the decimal `/`
/// divides a `2N`-limb scaled numerator by an `N`-limb divisor, the slice
/// roots divide bare runtime-length slices. Every caller already holds its
/// operands as slices, so none has to manufacture a const to call this. The
/// build-max Knuth `u`/`v` scratch lives in the engine ([`div_knuth`] owns
/// it), not here — the matcher allocates nothing. A concrete-`N` caller that
/// can size scratch exactly (`Int<N>: ComputeLimbs`) sources its own buffer
/// family and calls the chosen engine's `*_into` variant.
pub