1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
// SPDX-FileCopyrightText: 2026 John Moxley
// SPDX-License-Identifier: MIT OR Apache-2.0
//! Truncated-low multiply policy — the limb-width (`u64` / `u128`) matcher.
//!
//! [`BigInt::wrapping_mul_low_u128`] computes `(a · b) mod 2^(64·N)` — the
//! low `N` limbs of the product, the high half never formed — via the ONE
//! generic kernel [`mul_low_limb`]`<N, L: Limb>`. There is a single
//! algorithm (truncated-low schoolbook); what this policy owns is the
//! **second matcher axis** (`docs/ARCHITECTURE.md` → "Limb width — the
//! matcher's second axis"): the [`LimbSize`] the kernel runs in.
//!
//! `u128` limbs halve the limb count (≈¼ the partial products at the cost
//! of a wider 128×128 inner step), so they win on the **wide even** work
//! widths but lose to plain `u64` at narrow even widths (the pack/unpack
//! and wider-multiply overhead is not amortised). Which cells win is a
//! per-`N` property settled by microbench (`benches/micro/mul_low_u128_ab.rs`)
//! and recorded in [`limb_size`] as policy DATA — NOT a blanket rule and
//! NOT a kernel literal. `u128` is gated to **even `N`** by
//! [`LimbSize::for_packing`] (packing pairs two `u64` per `u128`; an odd
//! `N` would drop the top limb), so every entry stays even-`N`-correct.
//!
//! [`BigInt::wrapping_mul_low_u128`]: crate::int::types::traits::BigInt::wrapping_mul_low_u128
//! [`mul_low_limb`]: crate::int::algos::mul::mul_schoolbook::mul_low_limb
//! [`LimbSize`]: crate::int::types::compute_limbs::LimbSize
use cratemul_low_limb;
use crateLimbSize;
// ── 1. the algorithm — singleton: truncated-low schoolbook ────────────
/// The truncated-low multiply algorithm. A singleton: there is one
/// algorithm (the truncated-low schoolbook, [`mul_low_limb`] — the variant
/// is the CamelCase of the kernel fn minus the `mul_` prefix).
///
/// The [`LimbSize`] axis is the algorithm's OWN second-stage choice
/// ([`Algorithm::limb_size`]), selected *after* the algorithm and *by* it —
/// the u64/u128 crossover is algorithm-dependent, so it is co-located with
/// the algorithm, not the verdict.
///
/// [`mul_low_limb`]: crate::int::algos::mul::mul_schoolbook::mul_low_limb
// ── 2. the verdict — the algorithm (limb width is the algorithm's own) ─
/// A settled algorithm. The canonical verdict shape: one algorithm at every
/// `N`, so it is always `ByAlgorithm` (matching the const `add`/`sub`/`cmp`
/// policies). The limb width is NOT carried here — it is the chosen
/// algorithm's own [`Algorithm::limb_size`], derived in [`dispatch`].
// ── 3. the matcher ────────────────────────────────────────────────────
/// Pick the algorithm for the truncated-low product. One algorithm at every
/// width, so this is width-independent; the chosen algorithm's own
/// [`Algorithm::limb_size`] carries the only `N`-dependent decision.
const
/// Resolve the full verdict: the algorithm plus its own limb width for this
/// `N`. A named `const fn` (rather than statements inline in `dispatch`'s
/// `const { … }` block) because under `generic_const_exprs` (the nightly
/// `cross-scale-ops` / `exact-scratch-nightly` builds) a generic anonymous
/// constant only admits expression trees — a single call like this folds;
/// a statement block does not.
const
// ── 4. the dispatcher: resolve the algorithm, then its limb width ─────
/// Truncated-low product `out = (a · b) mod 2^(64·N)` — the single site
/// [`BigInt::wrapping_mul_low_u128`] flows through. Two-stage verdict: the
/// algorithm is resolved first, then asked for its own benched limb width
/// ([`Algorithm::limb_size`]). Both are const here, so the `const { … }`
/// block folds them and this compiles to one direct `mul_low_limb::<N, _>`
/// call per monomorphisation with the unchosen arm dead-arm eliminated.
/// `out` is written in full (the kernel zeroes its own accumulator);
/// bit-identical to [`BigInt::wrapping_mul`] mod `2^(64·N)` at either width.
///
/// [`BigInt::wrapping_mul_low_u128`]: crate::int::types::traits::BigInt::wrapping_mul_low_u128
/// [`BigInt::wrapping_mul`]: crate::int::types::traits::BigInt::wrapping_mul
pub