decimal-scaled 0.3.0

Const-generic base-10 fixed-point decimals (D9/D18/D38/D76/D153/D307) with integer-only transcendentals correctly rounded to within 0.5 ULP — exact at the type's last representable place. Deterministic across every platform; no_std-friendly.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
# Algorithms used in `decimal-scaled`

Catalogue of the published algorithms the crate evaluates, with
academic citations and the source files where each is implemented.
This is engineering credit - it complements `LICENSE-THIRD-PARTY`
(which covers verbatim/adapted code from upstream repositories) by
giving the *idea* attributions. For the lines-of-code attributions
see `LICENSE-THIRD-PARTY`.

## Integer arithmetic

### Möller–Granlund magic-number division by an invariant

Used for the `÷ 10^SCALE` step in every `Mul` / `Div` operator and in
`rescale`. The divisor `10^SCALE` is known at compile time, so a
pre-computed magic constant and a single 128-bit multiplication plus
a one-step correction replace a generic divide instruction. The
crate ships a 39-entry table (`MG_EXP_MAGICS`, scales 0–38) and the
256/128-bit fast-2-word divide built around it.

> Möller, N. and Granlund, T. (2011). **"Improved Division by Invariant
> Integers."** *IEEE Transactions on Computers* **60(2)**, 165–175.
> DOI: [10.1109/TC.2010.143]https://doi.org/10.1109/TC.2010.143.

Earlier foundational reference for the magic-multiplier idea:

> Granlund, T. and Montgomery, P. L. (1994). **"Division by Invariant
> Integers using Multiplication."** *Proc. PLDI '94*. ACM, 61–72.
> DOI: [10.1145/178243.178249]https://doi.org/10.1145/178243.178249.

Implementation: `src/mg_divide.rs` (`mul2`, `div_exp_fast_2word`,
`div_exp_fast_2word_with_rem`, `MG_EXP_MAGICS`). The algorithm shape
was adapted from the
[`primitive_fixed_point_decimal`](https://github.com/WuBingzheng/primitive_fixed_point_decimal)
crate - see `LICENSE-THIRD-PARTY` for the verbatim attribution.

Further reading:

- Wikipedia - [Division algorithm § Division by a constant]https://en.wikipedia.org/wiki/Division_algorithm#Division_by_a_constant
- Wolfram MathWorld - [Division]https://mathworld.wolfram.com/Division.html
- Niels Möller's homepage: <https://www.lysator.liu.se/~nisse/>
- Torbjörn Granlund's homepage (GMP project): <https://gmplib.org/~tege/>

### Limb storage shape — `[u64; 2·N]`

The wide-integer types (the eight power-of-two widths `Int256`,
`Int512`, `Int1024`, `Int2048`, `Int4096`, `Int8192`, `Int12288`,
`Int16384` plus the half-width siblings `Int192`, `Int384`, `Int768`,
`Int1536`, `Int3072`, `Int6144`) are stored as little-endian
`[u64; L]` arrays where `L` is the bit-width divided by 64. The choice exposes native `u64 × u64 →
u128` and `u128 / u64` hardware instructions directly (Zen 4 / Intel
Golden Cove issue one widening 64×64 mul per cycle in steady state)
where the historical `[u128; N]` layout had to soft-emulate every
mul as four `u64 × u64` sub-products plus a nested carry chain. The
public `from_limbs_le([u128; N])` / `limbs_le() -> [u128; N]` API
preserves its u128-shaped signatures via a 4-line const-fn boundary
conversion, so the wire format and downstream pattern-matching stay
bit-stable.

Implementation: every `limbs_*_u64` routine in
`src/wide_int/mod.rs`; the `$U` / `$S` storage in
`src/wide_int/macros/mod.rs`.

### Base-2⁶⁴ schoolbook multiplication

Standard `O(n²)` schoolbook on u64 limbs, using the native
`u64 × u64 → u128` widening multiply for each sub-product.
Karatsuba was implemented (and tested) but lost to schoolbook at
every tier this crate emits because the per-cycle widening mul
throughput beats Karatsuba's `3·(n/2)² + add/sub` overhead until
`n > ~64` limbs (beyond our widest tier). The Karatsuba code is
retained in `src/wide_int/mod.rs` (u128-base) as a reference / future
SIMD baseline.
Implementation: `src/wide_int/mod.rs::limbs_mul_u64`.

Further reading:

- Wikipedia - [Multiplication algorithm]https://en.wikipedia.org/wiki/Multiplication_algorithm
- Wolfram MathWorld - [Multiplication]https://mathworld.wolfram.com/Multiplication.html

### Möller–Granlund 2-by-1 invariant reciprocal in Knuth's q̂ loop

Inside [`Knuth Algorithm D`](#knuth-algorithm-d-multi-limb-divide),
each quotient limb's q̂ estimate would otherwise cost a 64-iteration
bit-recovery loop. Precomputing the divisor's 2-by-1 reciprocal
(Möller-Granlund Algorithm 4) once per call collapses the q̂ step
to ~2 multiplies plus a constant fix-up. Setup cost is one hardware
`u128 / u64` divide; amortised across the `m + 1` quotient-limb
estimations the per-limb work drops by ~30×.

> Möller, N. and Granlund, T. (2011). **"Improved Division by
> Invariant Integers."** *IEEE Transactions on Computers* **60(2)**,
> 165–175.

Implementation: `src/wide_int/mod.rs::MG2by1U64`.

A 3-by-2 sibling (`MG3by2U64`, also implemented per MG Algorithm 5
with the paper's Algorithm 6 reciprocal refinement that accounts
for `d0`) is kept available for arbitrary-divisor use cases. It was
*not* faster than 2-by-1 + refinement loop on decimal divisors
because the refinement loop almost never fires for our
(well-conditioned) divisors, so the 3-by-2's per-call extra
multiply costs more than it saves.

### Knuth Algorithm D — multi-limb divide

Textbook Algorithm D (Knuth, TAOCP Vol. 2, §4.3.1) adapted to base
2⁶⁴. Normalise the divisor so its top limb has the high bit set,
then for each quotient limb estimate q̂ from the top two limbs of
the running dividend (via the MG 2-by-1 reciprocal above), refine
once against the second-from-top divisor limb, multiply-subtract,
and add-back on the rare miss. Complexity `O(m·n)` limb-ops vs the
shift-subtract fallback's `O((m+n)·n·64)`.

Implementation: `src/wide_int/mod.rs::limbs_divmod_knuth_u64`,
fronted by `limbs_divmod_dispatch_u64`.

### Base-2⁶⁴ schoolbook long division (u64-divisor fast path)

For divisors that fit a single u64 word, the crate uses one hardware
`u128 / u64` divide per dividend limb — every `10^scale` with
`scale ≤ 19` lands here. This is the standard schoolbook long
division, now riding the native hardware instruction directly
(previously this path had to split each u128 limb into 64-bit halves
and do two divides per limb).
Implementation: `src/wide_int/mod.rs::limbs_divmod_u64` (fast path B);
`src/mg_divide.rs::div_long_256_by_128` (256-bit specialisation for
D38 transcendentals, still u128-typed by design).

Further reading:

- Wikipedia - [Long division]https://en.wikipedia.org/wiki/Long_division
- Wolfram MathWorld - [Long division]https://mathworld.wolfram.com/LongDivision.html

### Binary shift-subtract long division (const fallback)

Last-resort divide for arbitrary multi-limb divisors *in const
context*. Runtime callers route through the
[`limbs_divmod_dispatch_u64`](#knuth-algorithm-d-multi-limb-divide)
dispatcher to Knuth instead; this path stays as the `const fn`
backstop for `wrapping_div` / `wrapping_rem` and as the setup path
for the MG reciprocal computation.
Implementation: `src/wide_int/mod.rs::limbs_divmod_u64` general path.

Further reading:

- Wikipedia - [Division algorithm § Restoring division]https://en.wikipedia.org/wiki/Division_algorithm#Restoring_division
- Wolfram MathWorld - [Division]https://mathworld.wolfram.com/Division.html

## Roots

### Newton iteration for integer square root (`isqrt`)

`x_{k+1} = (x_k + N / x_k) / 2`, started from a power-of-2
overestimate so the sequence decreases monotonically. Converges
quadratically.

At wide tiers the per-iteration divide routes through
[`limbs_divmod_dispatch_u64`](#knuth-algorithm-d-multi-limb-divide),
i.e. Knuth Algorithm D with the MG 2-by-1 q̂ reciprocal. Earlier
versions used the const-context `limbs_divmod_u64` shift-subtract
path here, which made wide-tier sqrt 24–92× slower than necessary;
swapping to the runtime dispatcher closes that gap completely. D38
keeps its hand-tuned `isqrt_256` because the 256-bit specialisation
out-paces the generic path at single-limb scales.

Implementation: `src/mg_divide.rs::isqrt_256`,
`src/wide_int/mod.rs::limbs_isqrt_u64`.

Further reading:

- Wikipedia - [Methods of computing square roots § Heron's method]https://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Heron's_method
- Wikipedia - [Newton's method § Description]https://en.wikipedia.org/wiki/Newton%27s_method#Description (the parent recurrence)
- Wolfram MathWorld - [Square Root]https://mathworld.wolfram.com/SquareRoot.html, [Newton's Method]https://mathworld.wolfram.com/NewtonsMethod.html

### Newton iteration for integer cube root (`icbrt`)

`x_{k+1} = (2·x_k + N / x_k²) / 3`. Same monotone-decreasing setup
as `isqrt`. Implementation: `src/mg_divide.rs::icbrt_384`,
`src/macros/wide_roots.rs` (decl_wide_roots! emits a 384/512-bit
variant per wide tier).

Further reading:

- Wikipedia - [Cube root § Numerical methods]https://en.wikipedia.org/wiki/Cube_root#Numerical_methods
- Wolfram MathWorld - [Cube Root]https://mathworld.wolfram.com/CubeRoot.html

### Correctly-rounded sqrt / cbrt

After the integer root `q = floor(N^{1/k})`, the crate decides
"round up to `q+1`?" by an integer comparison of `N` against the
midpoint, which is an integer for sqrt (the midpoint test is
`N − q² > q`) and a multiple of `1/8` for cbrt (the test is
`8N ≥ (2q + 1)³`). For integer `N` the midpoint is never an integer
in either case, so the rounding decision is mode-independent -
every `RoundingMode` agrees with the half-to-nearest choice.
Implementation: `src/mg_divide.rs::sqrt_raw_correctly_rounded` /
`cbrt_raw_correctly_rounded`; the wide-tier counterparts in
`src/macros/wide_roots.rs`.

Further reading:

- Wikipedia - [IEEE 754 § Roundings to nearest]https://en.wikipedia.org/wiki/IEEE_754#Roundings_to_nearest (the "correctly rounded" contract the crate emulates at the storage scale)

## Transcendentals

### `ln` via multi-level sqrt argument reduction + Mercator's artanh

Range-reduce `x = 2^k · m` with `m ∈ [1, 2)`. Then apply `l` square
roots to shrink `m → m^(1/2^l)`, so the artanh-series argument
`t = (m' − 1)/(m' + 1)` is bounded by `0.35 · 2^-l`. The series then
converges in `~p / (2 + 2l)` pair-terms instead of `~p / 2`, traded
against `l` extra wide-isqrt calls. We pick `l ≈ √p_bits / 4`
empirically.

Reassembly: `ln(m) = 2^(l+1) · artanh(t)` (the extra `2^l` factor
folds into the bit-shift since `ln(m^(1/2^l)) = ln(m) / 2^l`). The
storage-scale identity `ln(x) = k·ln 2 + ln(m)` stays unchanged.

> Brent, R. P. (1976). **"Multiple-precision zero-finding methods
> and the complexity of elementary function evaluation."** In
> *Analytic Computational Complexity*, Academic Press.

Implementation: `src/macros/wide_transcendental.rs::ln_fixed`. The
fastnum crate uses the same sqrt-halving recursion as its `ln_`
inner — we cross-validated against it.

### `ln` via plain `artanh` series (legacy reference)

Range-reduce `x = 2^k · m` with `m ∈ [1, 2)`, then compute
`ln(m) = 2·artanh((m − 1) / (m + 1))`. The argument `t = (m − 1) /
(m + 1)` lies in `[0, 1/3]`, so the Mercator series
`artanh(t) = t + t³/3 + t⁵/5 + …` converges as roughly `3^(-n)` -
about 22 terms per decimal digit.

Mercator's logarithm series:

> Mercator, N. (1668). *Logarithmotechnia*. (Cited via Borwein &
> Borwein, "Pi and the AGM", 1987, Wiley.)

The artanh form is a textbook identity; combined with bit-by-bit
range reduction it's sometimes called the "Cody–Waite" approach
after the influential 1980 implementation:

> Cody, W. J. and Waite, W. (1980). **"Software Manual for the
> Elementary Functions."** Prentice-Hall.

Implementation: `src/log_exp_strict.rs::ln_fixed` (D38),
`src/macros/wide_transcendental.rs::ln_fixed` (every wide tier — D56 / D76 / D114 / D153 / D230 / D307 / D461 / D615 / D923 / D1231).

Further reading:

- Wikipedia - [Mercator series]https://en.wikipedia.org/wiki/Mercator_series (the `ln(1+x)` expansion at the top)
- Wikipedia - [Inverse hyperbolic functions § Series expansions]https://en.wikipedia.org/wiki/Inverse_hyperbolic_functions#Series_expansions (the `artanh` series the crate evaluates)
- Wolfram MathWorld - [Mercator Series]https://mathworld.wolfram.com/MercatorSeries.html, [Inverse Hyperbolic Tangent]https://mathworld.wolfram.com/InverseHyperbolicTangent.html

### `exp` via two-stage argument reduction + Taylor

Wide-tier exp uses Brent's two-stage argument reduction (dashu's
`exp_internal` pattern, traced to Brent 1976 §3):

1. **Stage 1 — modular:** `k = round(v/ln 2)`; `s = v − k·ln 2`,
   giving `|s| ≤ ln 2 / 2 ≈ 0.347`.
2. **Stage 2 — multiplicative:** `s ← s / 2^n` with
   `n ≈ √(precision_bits)`. After both reductions the Taylor
   argument satisfies `|r| < 2⁻ⁿ ≈ 2⁻√ᵖ`, so the series converges
   in `O(√p)` terms instead of `O(p)`. n is chosen via integer
   `sqrt(3w + 1)` (using `w·log₂(10) ≈ 3.32w` as the bit estimate).
3. **Taylor** on the reduced argument.
4. **Reassembly:** square `n` times to undo stage 2, then bit-shift
   by `k` to undo stage 1.

The squaring step replaces `≈ 60` Taylor mul+div pairs with
`n` plain wide multiplies, which is the dominant saving (a divide
is more expensive than a multiply at our widths even after the
u64 storage migration).

> Brent, R. P. (1976). **"Fast multiple-precision evaluation of
> elementary functions."** *Journal of the ACM* **23(2)**, 242–251.
> DOI: [10.1145/321941.321944]https://doi.org/10.1145/321941.321944.

Implementation: `src/macros/wide_transcendental.rs::exp_fixed`;
narrow tier in `src/log_exp_strict.rs`.

Further: `dashu-float::exp::Context::exp_internal` is the modern
reference implementation we cross-checked against.

### Cached `10^w` divisor in Taylor / AGM / Newton inner loops

Every `mul(a, b, w)` / `div(a, b, w)` in `wide_transcendental` does
a `round_div` against `10^w`. Computing `lit(10).pow(w)` from
scratch on each call costs ~10–50 µs at D307<150> (`w=180`, ~8 wide
squarings followed by ~180 cumulative multiplies). The cached
variants `mul_cached(a, b, pow10_w)` / `div_cached(...)` accept
a precomputed `10^w` and skip the recomputation; the inner Taylor
/ AGM / Newton loops hoist `let pow10_w = pow10(w);` once and pass
it down, saving ~20 % on every wide transcendental.

Implementation: `src/macros/wide_transcendental.rs::mul_cached`,
`div_cached`; consumed by `exp_fixed`, `ln_fixed`, `sin_taylor`,
`atan_taylor`.

### `exp` via plain Taylor (legacy reference)

Range-reduce `x = k · ln 2 + s` with `|s| ≤ ln 2 / 2`, then
`exp(x) = 2^k · exp(s)`. The Taylor series for `exp(s)` converges
absolutely on the reduced interval. The same Cody–Waite shape.
Implementation: `src/log_exp_strict.rs::exp_fixed`,
`src/macros/wide_transcendental.rs::exp_fixed`.

Further reading:

- Wikipedia - [Exponential function § Computation]https://en.wikipedia.org/wiki/Exponential_function#Computation (the Taylor series and the `2^k · exp(s)` reduction)
- Wikipedia - [Taylor series § Exponential function]https://en.wikipedia.org/wiki/Taylor_series#Exponential_function
- Wolfram MathWorld - [Exponential Function]https://mathworld.wolfram.com/ExponentialFunction.html, [Maclaurin Series]https://mathworld.wolfram.com/MaclaurinSeries.html

### `sin` via [0, π/4] reduction with sin / cos branching

Reduce `v` mod `τ` to `r ∈ [−π, π]`; fold to `|r| ∈ [0, π/2]` via
`sin(π − x) = sin(x)`; then route to **two sub-kernels** based on
whether the reduced argument lies in the lower or upper half of
`[0, π/2]`:

- `r ≤ π/4`: `sin_taylor(r)` — standard
  `r − r³/3! + r⁵/5! − …` series.
- `r > π/4`: `cos_taylor(π/2 − r)``1 − r²/2! + r⁴/4! − …` series
  on an argument in `[0, π/4]`. Cos's leading constant-`1` term
  means it converges marginally faster than sin at the same
  argument, and the [0, π/4] cap halves the Taylor argument range
  vs the historic [0, π/2].

Reference: Muller 2016 §11.4 attributes the "switch to cos at π/4"
trick to standard mid-precision practice.

Implementation: `src/macros/wide_transcendental.rs::sin_fixed`
(routes to `sin_taylor` / `cos_taylor`).

### `sin_cos` joint kernel

`sin_cos_strict(self) -> (sin, cos)`: shares the Taylor evaluation
between sin and cos, recovers cos via the Pythagorean identity
`|cos| = √(1 − sin²)`. Net cost ≈ one `sin_strict` + one wide
sqrt + one wide mul vs the historic two independent sin
evaluations (`cos = sin(x + π/2)`). 2–3× faster when both values
are needed.

> Pattern adapted from `fastnum::decimal::dec::math::sin_cos`.

Implementation: `src/macros/wide_transcendental.rs::sin_cos_fixed`
and `sin_cos_strict`.

### `sin` / `cos` via plain range-reduced Taylor (legacy reference)

Earlier implementation reduced to `[0, π/2]` and ran sin Taylor
directly without the cos branch; `cos(x) = sin(x + π/2)` was a
full second sin evaluation. Replaced by the variants above.
Implementation: `src/trig_strict.rs::sin_fixed`,
`src/macros/wide_transcendental.rs::sin_fixed` / `sin_taylor`.

Further reading:

- Wikipedia - [Taylor series § Trigonometric functions]https://en.wikipedia.org/wiki/Taylor_series#Trigonometric_functions (the `sin x = x − x³/3! + …` and `cos x = 1 − x²/2! + …` series)
- Wolfram MathWorld - [Sine]https://mathworld.wolfram.com/Sine.html, [Cosine]https://mathworld.wolfram.com/Cosine.html, [Maclaurin Series]https://mathworld.wolfram.com/MaclaurinSeries.html

### `atan` via per-width argument halvings + Taylor

The identity `atan(x) = 2·atan(x / (1 + √(1 + x²)))` halves the
argument; applying it `l` times reduces `|x|` by `~2^l`, then the
Taylor series for `atan` converges in `~p_bits / (2l)` terms.

The halving count is chosen per working scale `w`:

- `w < 60` → 5 halvings (D38 / D9 / D18 strict path)
- `60 ≤ w < 110` → 6 halvings (D56 / D76 / light D114)
- `w ≥ 110` → 7 halvings (D114 / D153 / D230 / D307 / D461 / D615 /
  D923 / D1231)

Break-even rationale: each halving costs ~one wide mul + one wide
sqrt + one wide div; each saved Taylor term saves ~one wide mul.
The trade-off depends on `p_bits` and sits in the 5–7 range for
our tiers.

Implementation: `src/trig_strict.rs::atan_fixed`,
`src/macros/wide_transcendental.rs::atan_fixed`.

### `atan2` with max-branch quotient selection

`atan2(y, x)` evaluates `atan(y/x)` plus a quadrant offset. The
historical implementation always fed `y/x` to `atan_fixed`, losing
`~log₂(|y/x|)` bits when `|y| ≫ |x|` because `atan_fixed`'s
argument-halving cascade had to consume them. The current
implementation max-branches: it feeds `atan_fixed` whichever of
`y/x` or `x/y` has `|·| ≤ 1` and applies the identity
`atan(t) = sign(t)·π/2 − atan(1/t)` for `|t| > 1` to recover the
quotient. Eliminates the asymptotic-edge precision loss; modest
speed win at any `|y/x|` significantly different from 1.

Implementation: `src/trig_strict.rs::atan2_strict` (D38),
`src/macros/wide_transcendental.rs::atan2_strict` (wide tier).

### `asin` / `acos` two-range kernel

Earlier path: `asin(x) = atan(x / √(1 − x²))`. At `|x| → 1` the
`1 − x²` subtraction cancelled all leading bits, losing ≈
`log(1/(1−|x|²))` digits of precision and risking 1-ULP error at
the asymptotic edge.

Two-range kernel preserves the 0-ULP contract at every
representable input:

- `|x| ≤ 0.5`: existing `atan(x / √(1 − x²))` path. At this range
  `1 − x² ∈ [0.75, 1]` — no cancellation, full precision.
- `|x| > 0.5`: half-angle identity
  `asin(|x|) = π/2 − 2·asin(√((1 − |x|)/2))`. Recurses once on
  `√((1 − |x|)/2) ∈ (0, 0.5]`, which hits the stable branch. The
  inner `(1 − |x|)/2` is exact (no cancellation: `|x| ≤ 1`
  guarantees `1 − |x| ≥ 0`), so the precision floor scales with
  the working scale instead of with the input's distance from 1.

`acos` shares the same kernel via `acos(x) = π/2 − asin(x)`.

Implementation: `src/trig_strict.rs::asin_strict` /
`acos_strict` (D38) and the wide-tier variants in
`src/macros/wide_transcendental.rs` (`asin_strict`,
`asin_strict_with`, `acos_strict`, `acos_strict_with`).

Further reading:

- Wikipedia - [Inverse trigonometric functions § Infinite series]https://en.wikipedia.org/wiki/Inverse_trigonometric_functions#Infinite_series (the `atan` Taylor series)
- Wikipedia - [Inverse trigonometric functions § Argument halving]https://en.wikipedia.org/wiki/Inverse_trigonometric_functions (the halving identity)
- Wolfram MathWorld - [Inverse Tangent]https://mathworld.wolfram.com/InverseTangent.html

### `π` via Machin's formula (wide tier only)

`π = 16·atan(1/5) − 4·atan(1/239)`. Each `atan` is evaluated via the
crate's Taylor implementation; with the small arguments the series
converges fast.

> Machin, J. (1706). Cited via Beckmann, P. (1971). *A History of π*.
> St. Martin's Press.

Implementation: `src/macros/wide_transcendental.rs::pi`. (D38
embeds `π` to 63 fractional digits as a literal - no series at run
time, since the constant fits the working width comfortably.)

Further reading:

- Wikipedia - [Machin-like formula]https://en.wikipedia.org/wiki/Machin-like_formula (the `π = 16 atan(1/5) − 4 atan(1/239)` equation at the top)
- Wolfram MathWorld - [Machin's Formula]https://mathworld.wolfram.com/MachinsFormula.html, [Pi Formulas]https://mathworld.wolfram.com/PiFormulas.html

### Hyperbolic functions

Composed from `exp`/`ln`:
- `sinh(x) = (eˣ − e⁻ˣ) / 2`
- `cosh(x) = (eˣ + e⁻ˣ) / 2`
- `tanh(x) = sinh(x) / cosh(x)`
- `asinh(x) = ln(x + √(x² + 1))` (with the `x ≥ 1` form factored as
  `ln(x) + ln(1 + √(1 + 1/x²))` to keep `` in the working width)
- `acosh(x) = ln(x + √(x² − 1))` (analogous factoring for `x ≥ 2`)
- `atanh(x) = ln((1 + x) / (1 − x)) / 2`

All textbook identities - no specific paper attribution.
Implementation: `src/trig_strict.rs`, `src/macros/wide_transcendental.rs`.

Further reading:

- Wikipedia - [Hyperbolic functions § Definitions in terms of the exponential function]https://en.wikipedia.org/wiki/Hyperbolic_functions#Definitions (the `sinh`/`cosh`/`tanh` identities)
- Wikipedia - [Inverse hyperbolic functions § Logarithmic forms]https://en.wikipedia.org/wiki/Inverse_hyperbolic_functions#Logarithmic_representation (the `asinh`/`acosh`/`atanh` log-forms)
- Wolfram MathWorld - [Hyperbolic Functions]https://mathworld.wolfram.com/HyperbolicFunctions.html, [Inverse Hyperbolic Functions]https://mathworld.wolfram.com/InverseHyperbolicFunctions.html

## Rounding

### Half-to-even (banker's rounding) and the IEEE-754 family

The crate's default rounding rule. Implementation in
`src/rounding.rs::should_bump`, dispatched per
[`RoundingMode`](src/rounding.rs) via a strategy hook.

> IEEE Std 754-2019. **"IEEE Standard for Floating-Point Arithmetic."**
> IEEE Standards Association.

Further reading:

- Wikipedia - [Rounding § Round half to even]https://en.wikipedia.org/wiki/Rounding#Round_half_to_even (the tie-breaking rule the crate uses by default)
- Wikipedia - [IEEE 754 § Roundings to nearest]https://en.wikipedia.org/wiki/IEEE_754#Roundings_to_nearest

## Constants

The mathematical constants in `src/consts.rs` (`pi`, `tau`,
`half_pi`, `quarter_pi`, `e`, `golden`) are stored as 37-digit raw
`i128` values at `SCALE_REF = 37` (the i128 maximum for the
largest of the six). Sources:

- `pi`, `tau`, `half_pi`, `quarter_pi`: ISO 80000-2.
- `e`: OEIS A001113.
- `golden`: OEIS A001622.

## Cross-over algorithms

- **Karatsuba multiplication.** Implemented in
  `wide_int::limbs_mul_karatsuba` and dispatched to by
  `wide_int::limbs_mul_fast` when both operands are equal-length and at
  least `KARATSUBA_MIN = 16` limbs. Below that threshold, schoolbook
  wins outright; at or above it, the recursive
  `(a₁b₁, (a₁+a₀)(b₁+b₀) − a₁b₁ − a₀b₀, a₀b₀)` decomposition reduces
  the asymptotic cost. In practice the threshold means storage tiers
  through Int1024 (8 limbs) use schoolbook; the 2048-/4096-bit work
  integers behind the wide-tier strict transcendentals (Int2048 = 16
  limbs, Int4096 = 32 limbs) hit the Karatsuba path. (Karatsuba, A. and
  Ofman, Yu. (1962). *Doklady Akad. Nauk SSSR* 145, 293–294.) Anatoly
  Karatsuba (1937–2008) and Yuri Ofman are both deceased; see the
  Wikipedia biography links below. Further reading:
  [Karatsuba algorithm]https://en.wikipedia.org/wiki/Karatsuba_algorithm,
  [Anatoly Karatsuba bio]https://en.wikipedia.org/wiki/Anatoly_Karatsuba,
  [Yuri Ofman bio]https://en.wikipedia.org/wiki/Yuri_Ofman,
  [MathWorld - Karatsuba Algorithm]https://mathworld.wolfram.com/KaratsubaAlgorithm.html.
- **AGM-based ln / exp (Brent–Salamin 1976).** `ln_strict_agm`
  (every wide tier) uses Brent's identity
  `ln(s) ≈ π / (2 · AGM(1, 4/s))` with range reduction
  `ln(x) = ln(x · 2^m) − m·ln 2`. `exp_strict_agm` uses Newton's
  iteration on `ln_strict_agm`. Both converge quadratically - `O(log
  p)` iterations vs the artanh path's `O(p)` series terms - so they
  win asymptotically as working scale grows. Currently exposed as
  the alternate path; the canonical `ln_strict` / `exp_strict` stays
  on the artanh / Taylor implementations until a bench at the
  relevant working scale shows AGM winning by the
  `OVERRIDE_POLICY.md` margin. *Caveat:* the present
  implementation runs the AGM iteration at the same working scale
  `w` as the artanh path; at storage scales beyond ~30 the early-
  phase `sqrt(a·b)` step's truncation error amplifies and the
  output drops to ~p/2 bits of precision. Brent §3 fixes this by
  raising intermediate AGM precision; recorded as a follow-up.
  (Brent, R. P. (1976). "Fast multiple-precision evaluation of
  elementary functions." *J. ACM* 23(2), 242–251.) Richard Brent
  is at ANU - homepage: <https://maths-people.anu.edu.au/~brent/>.
  Further reading:
  [Arithmetic–geometric mean]https://en.wikipedia.org/wiki/Arithmetic%E2%80%93geometric_mean#Definition
  (the `aₙ₊₁ = (aₙ+bₙ)/2`, `bₙ₊₁ = √(aₙ bₙ)` recurrence),
  [Gauss–Legendre algorithm]https://en.wikipedia.org/wiki/Gauss%E2%80%93Legendre_algorithm
  (the same AGM iteration applied to π),
  [MathWorld - Arithmetic-Geometric Mean]https://mathworld.wolfram.com/Arithmetic-GeometricMean.html.
- **Burnikel–Ziegler recursive division.** `limbs_divmod_bz` in
  `src/wide_int/mod.rs` is the recursive wrapper; its base case is
  the in-crate Knuth Algorithm D port (`limbs_divmod_knuth`,
  TAOCP §4.3.1 adapted to base 2^128). Both functions sit
  alongside the canonical const-fn binary `limbs_divmod` - the
  canonical path is unchanged, and `_knuth` / `_bz` are exposed for
  bench-driven promotion. Knuth's `O(m·n)` multi-limb shape beats
  the binary path's `O((m+n)·n·128)` for any multi-limb divisor;
  BZ's recursion adds value only past the threshold (currently
  `BZ_THRESHOLD = 8` limbs) and the full §3 two-by-one /
  three-by-two recursion is recorded as the next layer to add once
  a bench shows it winning at this crate's widths. (Burnikel, C.
  and Ziegler, J. (1998). "Fast recursive division." MPI-I-98-1-022;
  Knuth, D. E. (1981). *The Art of Computer Programming, Vol. 2:
  Seminumerical Algorithms*, §4.3.1.) Donald Knuth's homepage:
  <https://www-cs-faculty.stanford.edu/~knuth/>. Further reading:
  [Division algorithm § Newton–Raphson division and recursive
   division](https://en.wikipedia.org/wiki/Division_algorithm)
  (no dedicated BZ article, but the parent page lists the
   recursive-division family),
  [MathWorld - Long Division]https://mathworld.wolfram.com/LongDivision.html.
  The Burnikel–Ziegler tech report is the canonical algorithm
  reference: [MPI-I-98-1-022]https://pure.mpg.de/rest/items/item_1819444_4/component/file_2599480/content.
- **CORDIC.** Common in hardware floating-point; not competitive
  with Taylor + reduction in a software fixed-point context.
  Further reading: [CORDIC]https://en.wikipedia.org/wiki/CORDIC
  (the rotation-mode and vectoring-mode iterations are the central
  equations there), [MathWorld - CORDIC]https://mathworld.wolfram.com/CORDIC.html.

## Related external crates (benchmark baselines only)

- [`bnum`]https://github.com/isaacholt100/bnum - fixed-width
  big-integer crate, used as a wide-int baseline in
  `benches/wide_int_backends.rs`.
- [`ruint`]https://github.com/recmo/uint - Ethereum-flavoured
  wide-integer crate, used as a 256-bit baseline.
- [`rust_decimal`]https://github.com/paupino/rust-decimal -
  96-bit-mantissa decimal crate, used as a decimal baseline.
- [`fixed`]https://gitlab.com/tspiteri/fixed - binary fixed-point
  crate, used for the I64F64 baseline.

These crates are `dev-dependencies` only - they are never compiled
into a normal build.