echidna 0.8.2

A high-performance automatic differentiation library for Rust
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
# Evaluating Derivatives — Comprehensive Chapter Breakdown

**Source**: *Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, 2nd Edition*
by Andreas Griewank and Andrea Walther (SIAM, 2008)

**Purpose**: Technical reference for building the **echidna** Rust library — a comprehensive implementation of algorithmic differentiation (AD) techniques.

---

## How to Read This Document

Each chapter section contains:
- **Summary**: What the chapter covers and why it matters
- **Key Concepts**: The core mathematical and algorithmic ideas
- **Implementation Relevance**: What this means for echidna's design
- **Dependencies**: What must be built before this chapter's features

The chapters build on each other. The dependency graph is roughly:

```
Ch 1-3 (Foundations) → Ch 4 (Forward Mode) → Ch 5 (Reverse Mode)
                                ↓                     ↓
                          Ch 6 (Complexity)     Ch 7 (Taping)
                                ↓                     ↓
                          Ch 8 (2nd Order) ← ← ← ← ←┘
                    ┌───────────┼───────────────┐
                    ↓           ↓               ↓
              Ch 9 (Sparse) Ch 10 (Jacobian)  Ch 11 (Hessians)
                                              Ch 12 (Checkpointing)
                                              Ch 13 (Taylor/Higher-Order)
                                    ┌───────────────┼───────────┐
                                    ↓                           ↓
                              Ch 14 (Nonsmooth)          Ch 15 (Implicit)
```

---

## Part I: Basics (Chapters 1–4)

### Chapter 1 — Introduction (pp. 1–20)

**Summary**: Motivates AD by contrasting it with finite differences and symbolic differentiation. Introduces the fundamental idea: any numerical program computes a composition of elementary operations, and chain rule application to this composition yields exact derivatives at machine precision.

**Key Concepts**:
- **Truncation error** in finite differences: step size h creates O(h^p) error, but reducing h increases roundoff — no sweet spot
- **Expression swell** in symbolic differentiation: intermediate expressions grow exponentially
- **AD's niche**: exact derivatives (to machine precision) at bounded cost relative to function evaluation
- **Two modes**: forward (tangent) and reverse (adjoint), with fundamentally different cost profiles
- **Lighthouse example**: a running example used throughout the book — position of a light spot on a wall as a function of lighthouse rotation angle. Returns in Chapters 13, 14, and 15 with increasingly complex variants

**Implementation Relevance**: Establishes the core value proposition. The library must:
- Support both forward and reverse mode from the start
- Provide derivative accuracy at machine precision (not approximations)
- Scale to real-world numerical programs (not just toy expressions)

**Dependencies**: None — this is foundational motivation.

---

### Chapter 2 — A Framework for Evaluating Functions (pp. 21–48)

**Summary**: Defines the formal computational model that underpins all AD techniques. Introduces evaluation procedures, computational graphs, and the elemental function framework.

**Key Concepts**:
- **Evaluation Procedure**: A sequence of elementary operations transforming inputs x ∈ ℝ^n to outputs y ∈ ℝ^m via intermediates v_i
  - Three-part structure: independent variables (v_{1-n}...v_0), intermediate variables (v_1...v_l), dependent variables (y_1...y_m)
  - Each v_i = φ_i(v_j)_{j ≺ i} where φ_i is an elemental function
- **Assumption (ED)**: Elemental Differentiability — all elemental functions φ_i are at least once Lipschitz-continuously differentiable on open domains containing all argument values
- **Computational Graph**: DAG with vertices for variables and edges for direct dependencies
  - Edge labels c_{ij} = ∂v_i/∂v_j are the local partial derivatives
  - The graph encodes the complete data dependency structure
- **Extended Jacobian**: The (n+l) × (n+l) matrix L with entries c_{ij}, strictly lower triangular
- **Jacobian as Schur Complement**: F'(x) = R + T(I − L)^{−1}B where L captures internal dependencies, B input effects, T output effects, R direct input→output effects
- **Elemental functions**: unary (sin, cos, exp, log, sqrt, abs, ...) and binary (+, −, ×, /, pow, ...)
- **Table 2.2**: Key notation reference used throughout the book

**Implementation Relevance**: This chapter defines echidna's core data model:
- **`EvalProcedure`**: The central representation of a differentiable computation
- **`ComputationalGraph`**: DAG with typed nodes (input, intermediate, output) and weighted edges
- **Elemental operations**: Need a trait/enum for all supported elementary functions with their local derivatives
- The three-part variable structure (inputs, intermediates, outputs) shapes the API

**Dependencies**: None — this is the foundational data model.

---

### Chapter 3 — Fundamentals of Forward and Reverse (pp. 49–72)

**Summary**: Derives the forward and reverse mode formulas from chain rule applied to computational graphs. Establishes the fundamental complexity results.

**Key Concepts**:
- **Forward Mode (Tangent Propagation)**:
  - v̇_i = Σ_{j ≺ i} c_{ij} · v̇_j (propagate input perturbation forward)
  - Computes one column of Jacobian per seed direction ẋ
  - Cost: O(n) × TIME(F) for full n×m Jacobian when seeding with identity
  - Natural for n ≤ m (few inputs, many outputs)
- **Reverse Mode (Adjoint Propagation)**:
  - v̄_j += v̄_i · c_{ij} for all i ≻ j (propagate output sensitivity backward)
  - Computes one row of Jacobian per adjoint seed ȳ
  - **Cheap Gradient Principle**: ∇f costs ≤ ω × TIME(F) where ω ≤ 4 (typically 2–3)
  - Natural for m ≤ n (few outputs, many inputs) — the common case in optimization
- **Jacobian-Vector Products**: Forward mode computes F'(x)·ẋ; reverse mode computes ȳ^T·F'(x)
- **Vertex elimination interpretation**: Forward = eliminate from left; Reverse = eliminate from right
- **Table 3.2–3.4**: Forward and reverse propagation rules for all elemental operations
- **Incremental vs nonincremental**: Two implementation styles for local partial accumulation

**Implementation Relevance**: Core differentiation engine:
- **Forward mode**: Dual numbers with tangent component, seed vector propagation
- **Reverse mode**: Tape recording + backward sweep, adjoint accumulation
- Both modes need the same elemental derivative rules (Tables 3.2–3.4)
- The ω ≤ 4 bound for gradients is the key selling point for optimization users

**Dependencies**: Chapter 2 (evaluation procedure and graph model).

---

### Chapter 4 — Memory Issues and Complexity Bounds (pp. 73–100)

**Summary**: Analyzes the computational and memory complexity of AD in detail. Introduces the WORK vector model for tracking costs and addresses the key memory challenge of reverse mode.

**Key Concepts**:
- **WORK vector**: 4-component (MULT, NLMULT, ADD, NLADD) tracking multiplications and additions (linear vs nonlinear separately)
- **Assumption (TA)**: Temporal Additivity — total work is sum of elemental work
- **Assumption (EB)**: Elemental Task Boundedness — each elemental operation has bounded work
- **Temporal complexity**: Forward mode adds at most WORK(F) per direction; reverse mode at most 4× WORK(F) for gradient
- **Spatial complexity of reverse mode**: Must store all intermediate values (prevalues) for backward sweep
  - Naive: O(l) storage where l = number of intermediates
  - This motivates checkpointing (Chapter 12) and taping strategies (Chapter 7)
- **FLOP counting**: Precise operation counts for forward/reverse (Tables 4.1–4.5)
- **Data flow analysis**: Liveness analysis for reducing memory — only store values needed in the backward sweep
- **Homogeneity observation**: Temporal complexity of tangent and adjoint propagation are roughly equal per direction; the asymmetry is in how many directions are needed

**Implementation Relevance**:
- Memory management is critical — reverse mode's O(l) storage is the main engineering challenge
- Need efficient tape/trace data structures from the start
- WORK vector model useful for cost estimation and optimization heuristics
- Data flow / liveness analysis can dramatically reduce memory for reverse mode

**Dependencies**: Chapters 2–3.

---

## Part II: Extensions and Strategies (Chapters 5–8)

### Chapter 5 — Forward and Reverse Revisited (pp. 101–124)

**Summary**: Extends forward and reverse mode with Jacobian matrix propagation, vectorized operations, and combined forward-reverse strategies.

**Key Concepts**:
- **Jacobian Propagation**: Propagate entire Jacobian matrices V̇ (n×p tangent matrix) or W̄ (q×n adjoint matrix) instead of individual vectors
  - Forward: computes F'(x) · S for seed matrix S
  - Reverse: computes W · F'(x) for weight matrix W
- **Strip-mining**: Process multiple tangent/adjoint directions simultaneously for cache efficiency
- **Combined forward-reverse**: For intermediate Jacobians, can mix modes
  - Forward first, then reverse on the accumulated result
  - Or reverse first, then forward
- **Jacobian-free methods**: Iterative solvers (CG, GMRES) that only need Jacobian-vector products — AD provides these without ever forming the full Jacobian
- **Compression**: When Jacobian has known sparsity structure, can recover it from fewer directional derivatives than full dimension (detailed in Chapter 9)
- **Hessian-vector products**: Brief introduction — tangent-of-adjoint gives ∇²f · ẋ (detailed in Chapter 8)

**Implementation Relevance**:
- API should support both single-vector and matrix propagation modes
- Strip-mining / SIMD-friendly data layouts for vectorized tangent propagation
- Integration point for iterative linear algebra (CG, GMRES) that consumes Jv products
- Foundation for the compressed Jacobian computation in Chapter 9

**Dependencies**: Chapters 2–4.

---

### Chapter 6 — Implementation and Software (pp. 125–148)

**Summary**: Covers practical implementation approaches: operator overloading vs source transformation, and the software engineering of AD tools.

**Key Concepts**:
- **Operator Overloading (OO)**:
  - Redefine arithmetic operators on a custom numeric type that records operations
  - Advantages: easy to implement, works with existing code via type substitution
  - Disadvantages: overhead per operation, limited compile-time optimization, control flow recorded at runtime
  - Rust analogy: implement `Add`, `Mul`, etc. traits on an AD type
- **Source Transformation (ST)**:
  - Generate new source code that computes both function and derivatives
  - Advantages: full program analysis, optimal code generation, no runtime overhead
  - Disadvantages: requires parser/compiler infrastructure, language-specific
- **Taping / Tracing** (OO approach):
  - Record a "tape" of operations during forward evaluation
  - Replay tape backward for reverse mode
  - Tape stores: operation codes, argument indices, prevalues, partial derivatives
- **Activity analysis**: Identify which variables actually depend on inputs (active) vs constants — avoid differentiating dead code
- **Checkpointing** preview: Store selected snapshots to enable re-evaluation of segments (Chapter 12)
- **AD tools landscape**: ADIFOR, ADIC, TAF, Tapenade, ADOL-C, CppAD, etc.
- **Figure 6.1**: Taxonomy of AD implementation approaches

**Implementation Relevance**: Defines echidna's implementation strategy:
- Rust's trait system maps naturally to operator overloading approach
- Need efficient tape data structure (operation log, value storage)
- Activity analysis as an optimization pass
- Consider both eager (OO) and lazy (expression graph) evaluation strategies
- ADOL-C's design (C++ OO with tape) is the closest precedent

**Dependencies**: Chapters 2–5.

---

### Chapter 7 — Adjoint Code for Explicit Evaluations (pp. 149–186)

**Summary**: Details the implementation of reverse mode, including taping strategies, incremental vs nonincremental approaches, and practical adjoint code generation.

**Key Concepts**:
- **Assumption (PC)**: Path Connectedness — at most two arguments per elemental function (simplifies to binary DAG)
- **Definition (MW)**: Matrix Width — max simultaneous live values during sweep; determines memory for Jacobian propagation
- **Incremental Adjoint**: v̄_j += v̄_i · ∂φ_i/∂v_j — accumulate contributions as encountered
- **Nonincremental Adjoint**: v̄_j = Σ_{i ≻ j} v̄_i · c_{ij} — compute all at once from successors
  - Nonincremental requires knowing the complete set of successors before computing
- **Taping detail**:
  - LIFO (stack) organization: values recorded during forward, consumed during reverse
  - Three tapes: real values, integer indices, operation codes
  - Tape compression: don't record linear operations' prevalues (they're reconstructable)
- **Preaccumulation**: Compute local Jacobians of straight-line code segments before recording, reducing tape size
- **Subroutine handling**: Record/replay at subroutine boundaries; option to preaccumulate entire subroutines
- **Statement-level vs subroutine-level**: Granularity of taping
- **Split vs joint reversal**: Split = reverse each subroutine call independently; Joint = one big reverse
  - Trade-off: joint saves memory but splits are more parallelizable

**Implementation Relevance**: Core reverse mode implementation:
- Tape design: three separate stacks (reals, indices, opcodes) for cache efficiency
- Preaccumulation optimization for straight-line code segments
- Support for both incremental and nonincremental adjoint styles
- Subroutine-level taping for composability
- Memory layout matters enormously for performance

**Dependencies**: Chapters 2–6.

---

### Chapter 8 — Second-Order Methods (pp. 187–218)

**Summary**: Computing second derivatives (Hessians) efficiently via combinations of forward and reverse mode.

**Key Concepts**:
- **Hessian-vector product** ∇²f(x) · ẋ via tangent-of-adjoint (forward-over-reverse):
  1. Forward sweep recording tape
  2. Reverse sweep computing gradient (adjoint)
  3. Apply forward mode to the reverse sweep itself
  - Cost: O(1) × TIME(F) per Hessian-vector product
- **Second-order adjoint**: The full formula for propagating second-order information
  - Involves v̈ (second tangent), v̄ (adjoint), and w̄ (second-order adjoint) variables
  - Equations 8.8–8.12: coupled system of first and second order
- **Hessian matrix computation**: Need n Hessian-vector products for full n×n Hessian, or exploit sparsity
- **Symmetry exploitation**: Hessian is symmetric, so only need upper/lower triangle
- **Living-on-the-edge**: Compute Hessian-vector products without forming intermediate Jacobians
- **Second-order forward mode**: Propagate (v, v̇, v̈, v̈̇) — four components per variable
  - Computes ẋ^T ∇²f ẋ (scalar second directional derivative)
  - Need p(p+1)/2 evaluations for full Hessian via distinct direction pairs
- **Tables 8.1–8.5**: Propagation rules for second-order tangent and adjoint for all elementals
- **Edge pushing**: Alternative Hessian computation by pushing second-order information along graph edges

**Implementation Relevance**:
- Forward-over-reverse (tangent-of-adjoint) is the primary Hessian-vector product method
- Need composable mode nesting: forward mode applied to reverse mode code
- Second-order elemental rules (Tables 8.1–8.5) extend the first-order tables
- Hessian-vector products are critical for Newton methods in optimization

**Dependencies**: Chapters 3–7.

---

## Part III: Sparse and Structured Jacobians (Chapters 9–11)

### Chapter 9 — Sparse Forward and Reverse (pp. 219–250)

**Summary**: Exploiting sparsity in Jacobians by propagating only the nonzero structure, avoiding unnecessary computation.

**Key Concepts**:
- **Index domains and ranges**: Track which input indices affect each intermediate (domain) and which outputs each intermediate affects (range)
  - Forward: propagate index domains d_i ⊆ {1,...,n}
  - Reverse: propagate index ranges r_j ⊆ {1,...,m}
- **Sparse tangent propagation**: Only compute/store nonzero entries of tangent vectors
  - Use compressed storage (sparse vectors) instead of dense
  - Skip computation when tangent is structurally zero
- **Sparse adjoint propagation**: Analogous with adjoint vectors
- **Structural analysis**: Determine Jacobian sparsity pattern without computing values
  - Forward sweep of index sets gives column structure
  - Reverse sweep of index sets gives row structure
- **Definition (LP)**: Local Procedure — a subset of the evaluation procedure with its own local Jacobian
- **Definition (SC)**: Jacobian Dimension and Scarcity — when the Jacobian has s << n×m nonzeros
- **Matrix Compression** (CPR seeding):
  - **Column compression**: If Jacobian columns can be grouped such that no two columns in a group share a nonzero row, seed with p < n directions and recover full Jacobian
  - **Row compression**: Analogous for rows using adjoint mode
  - **Combined**: CPR (Column-Partial-Row) — use both forward and reverse with compression
  - Equivalent to **graph coloring**: column compression ↔ distance-2 coloring of column intersection graph
  - NR (Newsam-Ramsdell) seeding: exploit known sparsity pattern for optimal seeding
- **Figures 9.1–9.4**: Visualization of sparse structures and compression

**Implementation Relevance**:
- Sparse Jacobian computation is essential for large-scale problems
- Need sparse vector/matrix data structures with efficient merge operations
- Index set propagation for structural analysis (precursor to numerical computation)
- Graph coloring algorithms for optimal seed generation
- CPR/NR seeding strategies as built-in optimization

**Dependencies**: Chapters 2–7.

---

### Chapter 10 — Jacobian Accumulation and Elimination (pp. 251–274)

**Summary**: Cross-country elimination — computing Jacobians by eliminating vertices from the computational graph in non-standard orders (neither pure forward nor pure reverse).

**Key Concepts**:
- **Cross-country elimination**: Eliminate vertices from computational graph in any order, not just left-to-right (forward) or right-to-left (reverse)
  - Each elimination of vertex v_i creates fill-in edges: for each predecessor j and successor k of v_i, add edge (k,j) with weight c_{kj} += c_{ki} · c_{ij}
  - Total Jacobian = product of all local eliminations
- **Vertex elimination**: Remove a vertex, connecting all its predecessors to all successors with appropriate weights
- **Edge elimination**: More fine-grained — eliminate individual edges
- **Face elimination**: Most fine-grained — eliminate individual "faces" (predecessor-vertex-successor triples)
- **Definition (BP)**: Biclique Property — structural condition for equivalence of elimination approaches
- **NP-hardness**: Optimal elimination order (minimum total multiplications) is NP-hard
  - Even for vertex elimination alone
- **Markowitz heuristic**: Greedily eliminate vertex with minimum (in-degree × out-degree) product
  - Practical and effective, inspired by sparse LU factorization
- **Simulated annealing** and other metaheuristics for better elimination orders
- **Figures 10.1–10.8**: Detailed examples of vertex, edge, and face elimination on computational graphs
- **Jacobian accumulation as matrix product**: The accumulated Jacobian is equivalent to a specific sequence of sparse matrix multiplications

**Implementation Relevance**:
- Cross-country elimination as an optimization for Jacobian computation
- Markowitz heuristic implementation for elimination ordering
- Graph transformation operations (vertex/edge/face elimination with fill-in)
- Useful for small-to-medium computational graphs where optimal ordering matters
- Less critical than forward/reverse for initial implementation

**Dependencies**: Chapters 2–3, Chapter 9.

---

### Chapter 11 — Hessians, Higher Derivatives, and Structured Methods (pp. 275–300)

**Summary**: Efficient computation of Hessians exploiting structure: partial separability, symmetric computational graphs, and connections to Chapter 8's second-order methods.

**Key Concepts**:
- **Definition (SG)**: Symmetric Computational Graph — graph where forward and reverse sweeps have identical structure
  - Rule 16: Nonincremental reverse mode naturally produces a symmetric graph
  - This symmetry is key for efficient Hessian computation
- **Hessian Product Form**: ∇²f = Ź^T D Ź where Ź is the extended tangent matrix and D is diagonal
  - Computed by one forward sweep (getting Ź and D) + matrix multiply
  - D contains second derivatives of elemental functions
- **Definition (PS)**: Partial Separability — f(x) = Σ f_i(U_i x) where each f_i depends on few variables (U_i selects a subset)
  - **Value separability**: Direct sum decomposition of function value
  - **Argument separability**: Each component depends on few input variables
  - Hessian of partially separable function is sum of low-rank terms
- **Rule 17**: Cheap gradient principle does NOT extend to cheap Jacobians
  - Full m×n Jacobian always requires min(m,n) sweeps, no constant-factor bound
- **Rule 18**: Reverse beats forward for Jacobians only when nonlinear matrix width >> nonlinear height
- **Coloring for Hessians**: Exploit symmetry to reduce number of Hessian-vector products
  - Star coloring, acyclic coloring for direct/indirect recovery
  - Fewer colors needed than for general Jacobians due to symmetry
- **Tables 11.1–11.3**: Complexity comparisons for various Hessian computation strategies

**Implementation Relevance**:
- Partial separability detection and exploitation — common in optimization (each constraint involves few variables)
- Hessian product form (Ź^T D Ź) as an efficient representation
- Star/acyclic coloring for compressed Hessian computation
- Symmetric graph construction for Hessian-specific optimizations

**Dependencies**: Chapters 2–3, 8–9.

---

## Part IV: Memory Management and Advanced Topics (Chapters 12–15)

### Chapter 12 — Reversal Schedules and Checkpointing (pp. 301–340)

**Summary**: The fundamental memory management challenge of reverse mode and its solution via checkpointing. One of the most practically important chapters.

**Key Concepts**:
- **The problem**: Reverse mode needs all intermediate values, requiring O(l) memory for l operations. For long computations (e.g., time-stepping), this is prohibitive.
- **Definition (RS)**: Reversal Schedules — four atomic motions:
  1. **Advancing**: Evaluate forward without recording (move forward in time)
  2. **Recording**: Evaluate forward while recording to tape (expensive in memory)
  3. **Returning**: Move read head back to a checkpoint (no computation)
  4. **Reversing**: Execute one step of the adjoint (consumes recorded data)
- **Checkpointing strategy**: Store snapshots at selected time steps; re-evaluate segments between checkpoints as needed during the reverse sweep
- **Binomial checkpointing** (Griewank's Algorithm):
  - With c checkpoints and r reversals needed, optimal reach is β(c,r) = C(c+r, c) (binomial coefficient)
  - This gives logarithmic growth: l steps need only O(log l) checkpoints for O(l log l) total operations
  - Optimal in the class of offline schedules
  - Rule 21: Relative reversal cost grows like l^(1/c) — polynomial in l for fixed c
- **Online checkpointing**: When total number of steps l is not known in advance
  - Revolve algorithm and its online variants
  - Slightly suboptimal but practical for adaptive computations
- **Rule 19**: Joint reversals save memory; split reversals save operations
- **Rule 20**: All taping during call tree reversals occurs in LIFO fashion
- **Parallel checkpointing**: Distribute checkpoint storage and recomputation across processors
- **Multi-level checkpointing**: Different storage tiers (registers, RAM, disk) with different costs
- **Figures 12.1–12.8**: Diagrams of various checkpointing schedules
- **Tables 12.1–12.3**: Optimal checkpoint placement and operation counts
- **Call tree reversal**: Checkpointing applied to recursive/hierarchical program structure, not just flat time-stepping

**Implementation Relevance**: Critical for practical reverse mode:
- **Checkpoint manager**: Allocate, store, and retrieve checkpoints
- **Revolve algorithm**: Implement the binomial checkpointing schedule
- **Online variant**: Support unknown-length computations
- **Multi-level storage**: Abstract over memory tiers
- **Integration with tape**: Checkpoints interact with the taping system (Chapter 7)
- This is where the "engineering" of reverse mode really lives

**Dependencies**: Chapters 2–7.

---

### Chapter 13 — Taylor and Tensor Coefficients (pp. 301–340)

**Summary**: Higher-order derivative computation via Taylor arithmetic — a unified framework that extends first-order AD to arbitrary derivative order.

**Key Concepts**:
- **Taylor Coefficient Functions (Definition TC)**: For v(t) = Σ v_k t^k, the coefficients v_k encode k-th derivative information: v_k = (1/k!) d^k v/dt^k |_{t=0}
- **Definition (AI)**: Approximants of Intermediates — truncated Taylor series to degree d
- **Taylor arithmetic**: Extend each elemental operation to operate on truncated Taylor series (polynomials of degree d)
  - Addition: coefficient-wise
  - Multiplication: convolution (v·w)_k = Σ_{j=0}^{k} v_j · w_{k-j}
  - Division, transcendentals: recurrence relations derived from defining equations
  - Tables 13.1–13.4: Complete rules for all elementals in Taylor arithmetic
- **Rule 22**: Forward differentiation on scalars is encapsulated by extension to truncated Taylor series (Taylor arithmetic)
- **Univariate Taylor Propagation (UTP)**: The forward sweep in Taylor arithmetic
  - Cost: O(d²) × TIME(F) for degree-d Taylor coefficients of all intermediates
- **Higher-order Jacobians**: Proposition 13.3 — ∂y_j/∂x_i follows a shifted structure (A_{j-i})
  - Corollary 13.1: Extension operator E_d and differentiation operator D commute
- **Higher-Order Adjoints**: All adjoint vectors F̄_j for j < d obtained from one reverse sweep in Taylor arithmetic
  - Cost: (1+d)² × TIME(F)
  - Rule 23: What works for first derivatives yields higher derivatives by extension to Taylor arithmetic
- **ODE Integration via Taylor Series**:
  - For ẋ(t) = F(x(t)), coefficients satisfy x_{k+1} = F_k(x_0,...,x_k)/(k+1)
  - Recursive computation: cost O(d²) for d coefficients, O(d³/3) for d coefficients plus their Jacobian
  - **Sensitivity equation**: X'(t) = F'(x(t))X(t), with X_k recursion (eq 13.23)
- **Coefficient Doubling** (Table 13.7): Use Newton's method to compute d Taylor coefficients in O(log₂(d+2)) sweeps instead of d sweeps
  - Based on Corollary 13.2: Linearity in Higher Coefficients — y_k is linear in x_j for j > k/2
- **DAE Support**: Differential-algebraic equations 0 = F_k(z_0,...,z_k,(k+1)z_{k+1})
  - Block Hessenberg Jacobian structure (eq 13.27)
  - Requires consistent initialization

**Implementation Relevance**:
- **Taylor number type**: Polynomial coefficient arrays with arithmetic operations
- **UTP engine**: Forward propagation of Taylor coefficients through all elementals
- **Higher-order adjoint**: Reverse sweep on Taylor arithmetic for gradients of higher-order coefficients
- **ODE solver integration**: Taylor series method for ODE integration (high-order, A-stable)
- **Coefficient doubling**: Newton-based acceleration for long Taylor expansions
- This is a major feature differentiator — most AD libraries only do first order

**Dependencies**: Chapters 2–7 (for the higher-order adjoint part, also Chapter 8).

---

### Chapter 14 — Differentiation without Differentiability (pp. 341–366)

**Summary**: Handling programs with branches, absolute values, min/max, and other nonsmooth operations. Extends AD to piecewise differentiable functions.

**Key Concepts**:
- **Definition (PD)**: Piecewise Differentiability — function is smooth on each piece of a finite partition of the domain, with pieces defined by "selection functions" (branch conditions)
- **Proposition 14.1** (Fischer's Result): On the interior of each piece's domain, derivatives are well-defined and computable by standard AD
- **Rule 24**: When program branches apply on open subdomains (generic case), AD yields correct derivatives
- **Four categories of nonsmooth elementals**:
  1. **Kinks**: |v|, min(u,v), max(u,v) — continuous but not differentiable at isolated points
  2. **Roots**: √u, arcsin(u), arccos(u) — bounded values but infinite derivatives at boundary
  3. **Steps**: sign(u), Heaviside function — discontinuous; encode branch selection
  4. **Poles**: 1/u, log(u), tan(u) — infinite values at boundary
- **Convention**: ±∞ × 0 = 0 = 0 × NaN (zero annihilates exceptional values)
- **Assumption (IC)**: Isolated Criticalities — critical points (where nonsmoothness occurs) are isolated on any line through the domain
- **Definition (SD)**: Stable Domain — subset where the same selection functions apply in a neighborhood
- **Proposition 14.2** / Rule 25: Functions given by evaluation procedures are almost everywhere (full-measure) either real analytic or stably undefined. The complement (where derivatives don't exist) has measure zero.
- **Laurent Numbers (Definition LN)**: Extended number system for tracking behavior near critical points
  - Leading exponent e_v, significant degree d_v, coefficient array
  - Special values: NaL (not-a-Laurent = NaN analogue), zero, log singularities
  - Tables 14.6–14.8: Arithmetic on Laurent numbers (addition, univariates, conditionals)
  - Proposition 14.3: Laurent arithmetic consistently propagates the Laurent model
- **Definition (RA)**: Regular Arc — parameterized curve along which function has well-defined Laurent expansion with finite determinacy degree d_*
- **Definition (FS)**: Finite Slopes — no roots or poles among critical elementals
- **Proposition 14.5**: Under finite slopes, the limiting Jacobian along a regular arc exists and belongs to Clarke's generalized gradient
- **Generalized Jacobians**: Clarke's generalized gradient ∂f = conv{g | g = lim ∇f(x_k), x_k → x}

**Implementation Relevance**:
- **Abs/min/max support**: Must handle kink functions correctly, propagating derivatives from the active branch
- **NaN/Inf propagation**: Follow the convention ±∞ × 0 = 0
- **Branch tracking**: Record which branch was taken during evaluation; derivatives valid on the interior of each branch region
- **Laurent arithmetic** (optional advanced feature): For analyzing behavior near critical points
- **Generalized gradients**: Support for nonsmooth optimization (e.g., L1 regularization, ReLU in neural networks)
- Practically: most users hit this via abs, min, max, ReLU, and conditional branches

**Dependencies**: Chapters 2–3, conceptually Chapter 13 for Laurent numbers.

---

### Chapter 15 — Implicit and Iterative Differentiation (pp. 367–398)

**Summary**: Computing derivatives through implicit function definitions and iterative solvers (Newton's method, fixed-point iterations) without differentiating every iteration.

**Key Concepts**:
- **Implicit Function Theorem application**: Given F(z,x) = 0 defining z = z(x) implicitly:
  - **Implicit tangent**: ż_* = −F_z(z_*,x)^{−1} · F_x(z_*,x) · ẋ (eq 15.8)
  - **Implicit adjoint**: w̄_* = F_z(z_*,x)^{−T} · f_z(z_*,x)^T · ȳ (eq 15.9)
  - Avoids differentiating the solver iterations — only needs converged solution and Jacobian of the residual
- **Assumption (JR)**: Jacobian Regularity — F_z(z_*,x) is nonsingular at the solution
- **Reduced Jacobian**: dy/dx expressed as a generalized Schur complement (eq 15.5)
  - Two bracketing options: forward (eq 15.6) vs reverse (eq 15.7)
- **Direct/Adjoint Sensitivity Equations** (eqs 15.10–15.11):
  - Direct: F_z · Ż + F_x = 0 → solve for Ż
  - Adjoint: F_z^T · W̄ + ... = 0 → solve for W̄
- **Derivative Quality Criteria** (Lemma 15.1): Forward and reverse consistency check — if both forward and reverse derivatives agree, solution accuracy is confirmed
- **Corrected Function Estimate** (Corollary 15.1): Using adjoint information to correct function values, doubling convergence order
- **Two-Phase Approach**: First converge the iteration, then differentiate:
  - Phase 1: Run iteration to convergence
  - Phase 2: Apply implicit differentiation at converged point
- **Piggyback Approach**: Propagate derivatives alongside iteration:
  - Delayed piggyback: start derivative propagation partway through iteration
  - Coupled iteration of z and ż (eq 15.25)
  - Proposition 15.1: Derivative convergence factor ≤ ρ_* (same as iteration)
- **Assumption (GC)**: Global Contractivity — ||G'_k|| ≤ ρ < 1 (contraction mapping)
- **Newton's method specifics**: G_k(z,x) = z − F_z(z,x)^{−1} F(z,x)
  - Quadratic convergence of function → linear convergence of derivatives via piggyback
- **Preconditioner Deactivation**: Simplified derivative recurrence when preconditioner is frozen (eq 15.29)
  - Tables 15.1–15.2: Direct and adjoint fixed point iteration algorithms
- **Rule 26**: Fixed point iterations can and should be adjoined by recording only single steps (not the entire iteration history)
- **Second-Order Adjoint Fixed Point** (Table 15.3): Simultaneous convergence of z, ż, w̄, and ŵ (second-order adjoint)
- **Extended Computational Graph** (Figure 15.7): Visualization of the coupled iteration structure
- **Convergence rates**: All derivative orders converge at ρ_*^k with polynomial lag factor k^d for d-th order

**Implementation Relevance**:
- **Implicit differentiation**: Core feature for differentiating through linear/nonlinear solvers
  - `implicit_tangent(F, z_star, x, x_dot)` and `implicit_adjoint(F, z_star, x, y_bar)` operations
- **Fixed-point iteration differentiation**: Piggyback approach for iterative methods
  - Record single iteration step, not entire history (Rule 26)
- **Integration with linear algebra**: Needs linear solves (F_z^{-1} · rhs) for implicit diff
- **Newton solver wrapper**: Differentiate through Newton's method transparently
- Critical for: PDE-constrained optimization, training equilibrium models, differentiating physics simulations with implicit time stepping

**Dependencies**: Chapters 2–7, Chapter 8 for second-order.

---

## Appendix: Cross-Cutting Concerns

### Complete Rule Index

| Rule | Statement | Chapter |
|------|-----------|---------|
| 1 | Problem functions should be evaluated as compositions of elementals | 2 |
| 2 | Evaluate derivatives along with function values | 2 |
| 3 | To propagate tangents, accumulate the chain rule from right to left (forward) | 3 |
| 4 | To propagate adjoints, accumulate the chain rule from left to right (reverse) | 3 |
| 5 | Gradient by reverse mode costs ≤ 4× function evaluation | 3 |
| 6 | Separate linear and nonlinear operations in cost analysis | 4 |
| 7 | If function can be evaluated, so can its Jacobian–vector product (forward) | 5 |
| 8 | If function can be evaluated, so can its transpose-Jacobian–vector product (reverse) | 5 |
| 9 | Exploit structure (sparsity, symmetry) in Jacobian computation | 5 |
| 10 | Operator overloading is simplest AD implementation approach | 6 |
| 11 | Source transformation gives best performance | 6 |
| 12 | Taping with compression reduces memory overhead | 7 |
| 13 | Preaccumulation of straight-line segments reduces tape size | 7 |
| 14 | Split reversals for parallelism, joint reversals for memory | 7 |
| 15 | Hessian-vector products via forward-over-reverse | 8 |
| 16 | Nonincremental reverse makes graph symmetric for Hessians | 11 |
| 17 | Cheap gradient does NOT yield cheap Jacobians | 11 |
| 18 | Reverse beats forward only when nonlinear width >> nonlinear height | 11 |
| 19 | Joint reversals save memory, split reversals save operations | 12 |
| 20 | All taping during call tree reversals occurs in LIFO fashion | 12 |
| 21 | Relative reversal cost grows like l^(1/c) | 12 |
| 22 | Forward differentiation encapsulated in Taylor arithmetic | 13 |
| 23 | First-derivative techniques extend to higher derivatives via Taylor arithmetic | 13 |
| 24 | Program branches on open subdomains → AD yields useful derivatives | 14 |
| 25 | Functions from eval procedures are a.e. real analytic or stably undefined | 14 |
| 26 | Fixed point iterations adjoined by recording only single steps | 15 |

### Complete Formal Definitions and Assumptions

**Assumptions:**
| Code | Name | Page | Chapter |
|------|------|------|---------|
| ED | Elemental Differentiability | 23 | 2 |
| TA | Temporal Additivity of Task | 74 | 4 |
| EB | Elemental Task Boundedness | 78 | 4 |
| PC | Path Connectedness | 149 | 7 |
| IC | Isolated Criticalities | 348 | 14 |
| JR | Jacobian Regularity | 371 | 15 |
| GC | Global Contractivity | 378 | 15 |

**Definitions:**
| Code | Name | Page | Chapter |
|------|------|------|---------|
| MW | Matrix Width, Maximal Domain, Range Size | 149 | 7 |
| BP | Biclique Property | 206 | 10 |
| LP | Local Procedure and Jacobian | 222 | 9 |
| SC | Jacobian Dimension and Scarcity | 227 | 9 |
| SG | Symmetric Computational Graph | 238 | 11 |
| PS | Partial Separability | 252 | 11 |
| RS | Reversal Schedules | 280 | 12 |
| TC | Taylor Coefficient Functions | 304 | 13 |
| AI | Approximants of Intermediates | 304 | 13 |
| PD | Piecewise Differentiability | 342 | 14 |
| SD | Stable Domain | 348 | 14 |
| LN | Laurent Number | 351 | 14 |
| RA | Regular Arc and Determinacy Degree | 360 | 14 |
| FS | Finite Slopes | 361 | 14 |

---

## Suggested Implementation Phases

Based on the dependency structure and practical utility, here is a suggested phasing for echidna development:

### Phase 1: Core Foundation
**Chapters 2–3** — Evaluation procedure model, computational graph, forward mode (dual numbers), reverse mode (basic tape)

Deliverables:
- Dual number type with operator overloading
- Basic tape-based reverse mode
- Elemental function library (all operations from Tables 3.2–3.4)
- Jacobian-vector and vector-Jacobian products

### Phase 2: Production Reverse Mode
**Chapters 4, 7, 12** — Memory management, efficient taping, checkpointing

Deliverables:
- Optimized tape data structures (three-stack design)
- Preaccumulation for straight-line segments
- Binomial checkpointing (Revolve algorithm)
- Online checkpointing for unknown-length computations
- Memory-bounded reverse mode for large-scale problems

### Phase 3: Second-Order Derivatives
**Chapters 5, 8** — Jacobian propagation, Hessian-vector products, forward-over-reverse

Deliverables:
- Matrix-mode tangent and adjoint propagation
- Forward-over-reverse Hessian-vector products
- Second-order elemental rules
- Composable mode nesting

### Phase 4: Sparse Computation
**Chapters 9, 11** — Sparse Jacobians, compressed computation, graph coloring

Deliverables:
- Sparsity pattern detection (index domain/range propagation)
- Graph coloring for seed generation (distance-2, star, acyclic)
- Compressed Jacobian and Hessian computation
- Partial separability detection and exploitation
- CPR/NR seeding strategies

### Phase 5: Advanced Jacobian Accumulation
**Chapter 10** — Cross-country elimination, Markowitz heuristic

Deliverables:
- Vertex/edge/face elimination on computational graphs
- Markowitz-based elimination ordering
- Optimal Jacobian accumulation for small graphs

### Phase 6: Higher-Order Derivatives
**Chapter 13** — Taylor arithmetic, univariate Taylor propagation, higher-order adjoints

Deliverables:
- Taylor number type (truncated polynomial arithmetic)
- UTP engine for all elementals
- Higher-order adjoint computation
- ODE integration via Taylor series
- Coefficient doubling (Newton acceleration)

### Phase 7: Nonsmooth Extensions
**Chapter 14** — Piecewise differentiable functions, abs/min/max, generalized gradients

Deliverables:
- Correct handling of abs, min, max, ReLU, conditional branches
- Branch tracking and active set identification
- Generalized gradient computation (Clarke subdifferential)
- Optional: Laurent number arithmetic for singularity analysis

### Phase 8: Implicit and Iterative Differentiation
**Chapter 15** — Implicit function theorem, fixed-point iteration differentiation

Deliverables:
- Implicit tangent and adjoint computation
- Piggyback differentiation for iterative solvers
- Newton solver with transparent differentiation
- Second-order implicit derivatives
- Fixed-point iteration adjoint (single-step recording per Rule 26)

---

## Key Design Decisions for echidna

These cross-cutting decisions are resolved in [design-principles.md](design-principles.md). Summary:

1. **Numeric type**: Generic over `Float` trait (`f32`/`f64`)
2. **Tape representation**: SoA bytecode tape (opcodes, arg_indices, values, adjoints as separate contiguous arrays)
3. **Implementation approach**: Operator overloading via Rust traits (eager mode), with optional deferred graph mode for advanced optimisations
4. **Memory allocation**: `bumpalo` arena for tapes, const-generic stack allocation for small tangent vectors
5. **Parallelism**: SIMD (intra-core via `wide`) + `rayon` (inter-core), coarse-grained parallelism only
6. **GPU**: `wgpu` (cross-platform) primary, `cudarc` (NVIDIA) optional, all behind feature flags with CPU fallback
7. **Linear algebra**: `nalgebra` for both fixed and dynamic sizes
8. **Error handling**: NaN propagation on hot paths (per Ch 14), `Result` on cold paths
9. **API**: Closure-based for common cases, builder pattern for advanced config, trait-based for library integration
10. **Composability**: Mode nesting via generic type composition (forward-over-reverse = `Dual<Tape<T>>`)