# Evaluating Derivatives — Comprehensive Chapter Breakdown
**Source**: *Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, 2nd Edition*
by Andreas Griewank and Andrea Walther (SIAM, 2008)
**Purpose**: Technical reference for building the **echidna** Rust library — a comprehensive implementation of algorithmic differentiation (AD) techniques.
---
## How to Read This Document
Each chapter section contains:
- **Summary**: What the chapter covers and why it matters
- **Key Concepts**: The core mathematical and algorithmic ideas
- **Implementation Relevance**: What this means for echidna's design
- **Dependencies**: What must be built before this chapter's features
The chapters build on each other. The dependency graph is roughly:
```
Ch 1-3 (Foundations) → Ch 4 (Forward Mode) → Ch 5 (Reverse Mode)
↓ ↓
Ch 6 (Complexity) Ch 7 (Taping)
↓ ↓
Ch 8 (2nd Order) ← ← ← ← ←┘
↓
┌───────────┼───────────────┐
↓ ↓ ↓
Ch 9 (Sparse) Ch 10 (Jacobian) Ch 11 (Hessians)
↓
Ch 12 (Checkpointing)
↓
Ch 13 (Taylor/Higher-Order)
↓
┌───────────────┼───────────┐
↓ ↓
Ch 14 (Nonsmooth) Ch 15 (Implicit)
```
---
## Part I: Basics (Chapters 1–4)
### Chapter 1 — Introduction (pp. 1–20)
**Summary**: Motivates AD by contrasting it with finite differences and symbolic differentiation. Introduces the fundamental idea: any numerical program computes a composition of elementary operations, and chain rule application to this composition yields exact derivatives at machine precision.
**Key Concepts**:
- **Truncation error** in finite differences: step size h creates O(h^p) error, but reducing h increases roundoff — no sweet spot
- **Expression swell** in symbolic differentiation: intermediate expressions grow exponentially
- **AD's niche**: exact derivatives (to machine precision) at bounded cost relative to function evaluation
- **Two modes**: forward (tangent) and reverse (adjoint), with fundamentally different cost profiles
- **Lighthouse example**: a running example used throughout the book — position of a light spot on a wall as a function of lighthouse rotation angle. Returns in Chapters 13, 14, and 15 with increasingly complex variants
**Implementation Relevance**: Establishes the core value proposition. The library must:
- Support both forward and reverse mode from the start
- Provide derivative accuracy at machine precision (not approximations)
- Scale to real-world numerical programs (not just toy expressions)
**Dependencies**: None — this is foundational motivation.
---
### Chapter 2 — A Framework for Evaluating Functions (pp. 21–48)
**Summary**: Defines the formal computational model that underpins all AD techniques. Introduces evaluation procedures, computational graphs, and the elemental function framework.
**Key Concepts**:
- **Evaluation Procedure**: A sequence of elementary operations transforming inputs x ∈ ℝ^n to outputs y ∈ ℝ^m via intermediates v_i
- Three-part structure: independent variables (v_{1-n}...v_0), intermediate variables (v_1...v_l), dependent variables (y_1...y_m)
- Each v_i = φ_i(v_j)_{j ≺ i} where φ_i is an elemental function
- **Assumption (ED)**: Elemental Differentiability — all elemental functions φ_i are at least once Lipschitz-continuously differentiable on open domains containing all argument values
- **Computational Graph**: DAG with vertices for variables and edges for direct dependencies
- Edge labels c_{ij} = ∂v_i/∂v_j are the local partial derivatives
- The graph encodes the complete data dependency structure
- **Extended Jacobian**: The (n+l) × (n+l) matrix L with entries c_{ij}, strictly lower triangular
- **Jacobian as Schur Complement**: F'(x) = R + T(I − L)^{−1}B where L captures internal dependencies, B input effects, T output effects, R direct input→output effects
- **Elemental functions**: unary (sin, cos, exp, log, sqrt, abs, ...) and binary (+, −, ×, /, pow, ...)
- **Table 2.2**: Key notation reference used throughout the book
**Implementation Relevance**: This chapter defines echidna's core data model:
- **`EvalProcedure`**: The central representation of a differentiable computation
- **`ComputationalGraph`**: DAG with typed nodes (input, intermediate, output) and weighted edges
- **Elemental operations**: Need a trait/enum for all supported elementary functions with their local derivatives
- The three-part variable structure (inputs, intermediates, outputs) shapes the API
**Dependencies**: None — this is the foundational data model.
---
### Chapter 3 — Fundamentals of Forward and Reverse (pp. 49–72)
**Summary**: Derives the forward and reverse mode formulas from chain rule applied to computational graphs. Establishes the fundamental complexity results.
**Key Concepts**:
- **Forward Mode (Tangent Propagation)**:
- v̇_i = Σ_{j ≺ i} c_{ij} · v̇_j (propagate input perturbation forward)
- Computes one column of Jacobian per seed direction ẋ
- Cost: O(n) × TIME(F) for full n×m Jacobian when seeding with identity
- Natural for n ≤ m (few inputs, many outputs)
- **Reverse Mode (Adjoint Propagation)**:
- v̄_j += v̄_i · c_{ij} for all i ≻ j (propagate output sensitivity backward)
- Computes one row of Jacobian per adjoint seed ȳ
- **Cheap Gradient Principle**: ∇f costs ≤ ω × TIME(F) where ω ≤ 4 (typically 2–3)
- Natural for m ≤ n (few outputs, many inputs) — the common case in optimization
- **Jacobian-Vector Products**: Forward mode computes F'(x)·ẋ; reverse mode computes ȳ^T·F'(x)
- **Vertex elimination interpretation**: Forward = eliminate from left; Reverse = eliminate from right
- **Table 3.2–3.4**: Forward and reverse propagation rules for all elemental operations
- **Incremental vs nonincremental**: Two implementation styles for local partial accumulation
**Implementation Relevance**: Core differentiation engine:
- **Forward mode**: Dual numbers with tangent component, seed vector propagation
- **Reverse mode**: Tape recording + backward sweep, adjoint accumulation
- Both modes need the same elemental derivative rules (Tables 3.2–3.4)
- The ω ≤ 4 bound for gradients is the key selling point for optimization users
**Dependencies**: Chapter 2 (evaluation procedure and graph model).
---
### Chapter 4 — Memory Issues and Complexity Bounds (pp. 73–100)
**Summary**: Analyzes the computational and memory complexity of AD in detail. Introduces the WORK vector model for tracking costs and addresses the key memory challenge of reverse mode.
**Key Concepts**:
- **WORK vector**: 4-component (MULT, NLMULT, ADD, NLADD) tracking multiplications and additions (linear vs nonlinear separately)
- **Assumption (TA)**: Temporal Additivity — total work is sum of elemental work
- **Assumption (EB)**: Elemental Task Boundedness — each elemental operation has bounded work
- **Temporal complexity**: Forward mode adds at most WORK(F) per direction; reverse mode at most 4× WORK(F) for gradient
- **Spatial complexity of reverse mode**: Must store all intermediate values (prevalues) for backward sweep
- Naive: O(l) storage where l = number of intermediates
- This motivates checkpointing (Chapter 12) and taping strategies (Chapter 7)
- **FLOP counting**: Precise operation counts for forward/reverse (Tables 4.1–4.5)
- **Data flow analysis**: Liveness analysis for reducing memory — only store values needed in the backward sweep
- **Homogeneity observation**: Temporal complexity of tangent and adjoint propagation are roughly equal per direction; the asymmetry is in how many directions are needed
**Implementation Relevance**:
- Memory management is critical — reverse mode's O(l) storage is the main engineering challenge
- Need efficient tape/trace data structures from the start
- WORK vector model useful for cost estimation and optimization heuristics
- Data flow / liveness analysis can dramatically reduce memory for reverse mode
**Dependencies**: Chapters 2–3.
---
## Part II: Extensions and Strategies (Chapters 5–8)
### Chapter 5 — Forward and Reverse Revisited (pp. 101–124)
**Summary**: Extends forward and reverse mode with Jacobian matrix propagation, vectorized operations, and combined forward-reverse strategies.
**Key Concepts**:
- **Jacobian Propagation**: Propagate entire Jacobian matrices V̇ (n×p tangent matrix) or W̄ (q×n adjoint matrix) instead of individual vectors
- Forward: computes F'(x) · S for seed matrix S
- Reverse: computes W · F'(x) for weight matrix W
- **Strip-mining**: Process multiple tangent/adjoint directions simultaneously for cache efficiency
- **Combined forward-reverse**: For intermediate Jacobians, can mix modes
- Forward first, then reverse on the accumulated result
- Or reverse first, then forward
- **Jacobian-free methods**: Iterative solvers (CG, GMRES) that only need Jacobian-vector products — AD provides these without ever forming the full Jacobian
- **Compression**: When Jacobian has known sparsity structure, can recover it from fewer directional derivatives than full dimension (detailed in Chapter 9)
- **Hessian-vector products**: Brief introduction — tangent-of-adjoint gives ∇²f · ẋ (detailed in Chapter 8)
**Implementation Relevance**:
- API should support both single-vector and matrix propagation modes
- Strip-mining / SIMD-friendly data layouts for vectorized tangent propagation
- Integration point for iterative linear algebra (CG, GMRES) that consumes Jv products
- Foundation for the compressed Jacobian computation in Chapter 9
**Dependencies**: Chapters 2–4.
---
### Chapter 6 — Implementation and Software (pp. 125–148)
**Summary**: Covers practical implementation approaches: operator overloading vs source transformation, and the software engineering of AD tools.
**Key Concepts**:
- **Operator Overloading (OO)**:
- Redefine arithmetic operators on a custom numeric type that records operations
- Advantages: easy to implement, works with existing code via type substitution
- Disadvantages: overhead per operation, limited compile-time optimization, control flow recorded at runtime
- Rust analogy: implement `Add`, `Mul`, etc. traits on an AD type
- **Source Transformation (ST)**:
- Generate new source code that computes both function and derivatives
- Advantages: full program analysis, optimal code generation, no runtime overhead
- Disadvantages: requires parser/compiler infrastructure, language-specific
- **Taping / Tracing** (OO approach):
- Record a "tape" of operations during forward evaluation
- Replay tape backward for reverse mode
- Tape stores: operation codes, argument indices, prevalues, partial derivatives
- **Activity analysis**: Identify which variables actually depend on inputs (active) vs constants — avoid differentiating dead code
- **Checkpointing** preview: Store selected snapshots to enable re-evaluation of segments (Chapter 12)
- **AD tools landscape**: ADIFOR, ADIC, TAF, Tapenade, ADOL-C, CppAD, etc.
- **Figure 6.1**: Taxonomy of AD implementation approaches
**Implementation Relevance**: Defines echidna's implementation strategy:
- Rust's trait system maps naturally to operator overloading approach
- Need efficient tape data structure (operation log, value storage)
- Activity analysis as an optimization pass
- Consider both eager (OO) and lazy (expression graph) evaluation strategies
- ADOL-C's design (C++ OO with tape) is the closest precedent
**Dependencies**: Chapters 2–5.
---
### Chapter 7 — Adjoint Code for Explicit Evaluations (pp. 149–186)
**Summary**: Details the implementation of reverse mode, including taping strategies, incremental vs nonincremental approaches, and practical adjoint code generation.
**Key Concepts**:
- **Assumption (PC)**: Path Connectedness — at most two arguments per elemental function (simplifies to binary DAG)
- **Definition (MW)**: Matrix Width — max simultaneous live values during sweep; determines memory for Jacobian propagation
- **Incremental Adjoint**: v̄_j += v̄_i · ∂φ_i/∂v_j — accumulate contributions as encountered
- **Nonincremental Adjoint**: v̄_j = Σ_{i ≻ j} v̄_i · c_{ij} — compute all at once from successors
- Nonincremental requires knowing the complete set of successors before computing
- **Taping detail**:
- LIFO (stack) organization: values recorded during forward, consumed during reverse
- Three tapes: real values, integer indices, operation codes
- Tape compression: don't record linear operations' prevalues (they're reconstructable)
- **Preaccumulation**: Compute local Jacobians of straight-line code segments before recording, reducing tape size
- **Subroutine handling**: Record/replay at subroutine boundaries; option to preaccumulate entire subroutines
- **Statement-level vs subroutine-level**: Granularity of taping
- **Split vs joint reversal**: Split = reverse each subroutine call independently; Joint = one big reverse
- Trade-off: joint saves memory but splits are more parallelizable
**Implementation Relevance**: Core reverse mode implementation:
- Tape design: three separate stacks (reals, indices, opcodes) for cache efficiency
- Preaccumulation optimization for straight-line code segments
- Support for both incremental and nonincremental adjoint styles
- Subroutine-level taping for composability
- Memory layout matters enormously for performance
**Dependencies**: Chapters 2–6.
---
### Chapter 8 — Second-Order Methods (pp. 187–218)
**Summary**: Computing second derivatives (Hessians) efficiently via combinations of forward and reverse mode.
**Key Concepts**:
- **Hessian-vector product** ∇²f(x) · ẋ via tangent-of-adjoint (forward-over-reverse):
1. Forward sweep recording tape
2. Reverse sweep computing gradient (adjoint)
3. Apply forward mode to the reverse sweep itself
- Cost: O(1) × TIME(F) per Hessian-vector product
- **Second-order adjoint**: The full formula for propagating second-order information
- Involves v̈ (second tangent), v̄ (adjoint), and w̄ (second-order adjoint) variables
- Equations 8.8–8.12: coupled system of first and second order
- **Hessian matrix computation**: Need n Hessian-vector products for full n×n Hessian, or exploit sparsity
- **Symmetry exploitation**: Hessian is symmetric, so only need upper/lower triangle
- **Living-on-the-edge**: Compute Hessian-vector products without forming intermediate Jacobians
- **Second-order forward mode**: Propagate (v, v̇, v̈, v̈̇) — four components per variable
- Computes ẋ^T ∇²f ẋ (scalar second directional derivative)
- Need p(p+1)/2 evaluations for full Hessian via distinct direction pairs
- **Tables 8.1–8.5**: Propagation rules for second-order tangent and adjoint for all elementals
- **Edge pushing**: Alternative Hessian computation by pushing second-order information along graph edges
**Implementation Relevance**:
- Forward-over-reverse (tangent-of-adjoint) is the primary Hessian-vector product method
- Need composable mode nesting: forward mode applied to reverse mode code
- Second-order elemental rules (Tables 8.1–8.5) extend the first-order tables
- Hessian-vector products are critical for Newton methods in optimization
**Dependencies**: Chapters 3–7.
---
## Part III: Sparse and Structured Jacobians (Chapters 9–11)
### Chapter 9 — Sparse Forward and Reverse (pp. 219–250)
**Summary**: Exploiting sparsity in Jacobians by propagating only the nonzero structure, avoiding unnecessary computation.
**Key Concepts**:
- **Index domains and ranges**: Track which input indices affect each intermediate (domain) and which outputs each intermediate affects (range)
- Forward: propagate index domains d_i ⊆ {1,...,n}
- Reverse: propagate index ranges r_j ⊆ {1,...,m}
- **Sparse tangent propagation**: Only compute/store nonzero entries of tangent vectors
- Use compressed storage (sparse vectors) instead of dense
- Skip computation when tangent is structurally zero
- **Sparse adjoint propagation**: Analogous with adjoint vectors
- **Structural analysis**: Determine Jacobian sparsity pattern without computing values
- Forward sweep of index sets gives column structure
- Reverse sweep of index sets gives row structure
- **Definition (LP)**: Local Procedure — a subset of the evaluation procedure with its own local Jacobian
- **Definition (SC)**: Jacobian Dimension and Scarcity — when the Jacobian has s << n×m nonzeros
- **Matrix Compression** (CPR seeding):
- **Column compression**: If Jacobian columns can be grouped such that no two columns in a group share a nonzero row, seed with p < n directions and recover full Jacobian
- **Row compression**: Analogous for rows using adjoint mode
- **Combined**: CPR (Column-Partial-Row) — use both forward and reverse with compression
- Equivalent to **graph coloring**: column compression ↔ distance-2 coloring of column intersection graph
- NR (Newsam-Ramsdell) seeding: exploit known sparsity pattern for optimal seeding
- **Figures 9.1–9.4**: Visualization of sparse structures and compression
**Implementation Relevance**:
- Sparse Jacobian computation is essential for large-scale problems
- Need sparse vector/matrix data structures with efficient merge operations
- Index set propagation for structural analysis (precursor to numerical computation)
- Graph coloring algorithms for optimal seed generation
- CPR/NR seeding strategies as built-in optimization
**Dependencies**: Chapters 2–7.
---
### Chapter 10 — Jacobian Accumulation and Elimination (pp. 251–274)
**Summary**: Cross-country elimination — computing Jacobians by eliminating vertices from the computational graph in non-standard orders (neither pure forward nor pure reverse).
**Key Concepts**:
- **Cross-country elimination**: Eliminate vertices from computational graph in any order, not just left-to-right (forward) or right-to-left (reverse)
- Each elimination of vertex v_i creates fill-in edges: for each predecessor j and successor k of v_i, add edge (k,j) with weight c_{kj} += c_{ki} · c_{ij}
- Total Jacobian = product of all local eliminations
- **Vertex elimination**: Remove a vertex, connecting all its predecessors to all successors with appropriate weights
- **Edge elimination**: More fine-grained — eliminate individual edges
- **Face elimination**: Most fine-grained — eliminate individual "faces" (predecessor-vertex-successor triples)
- **Definition (BP)**: Biclique Property — structural condition for equivalence of elimination approaches
- **NP-hardness**: Optimal elimination order (minimum total multiplications) is NP-hard
- Even for vertex elimination alone
- **Markowitz heuristic**: Greedily eliminate vertex with minimum (in-degree × out-degree) product
- Practical and effective, inspired by sparse LU factorization
- **Simulated annealing** and other metaheuristics for better elimination orders
- **Figures 10.1–10.8**: Detailed examples of vertex, edge, and face elimination on computational graphs
- **Jacobian accumulation as matrix product**: The accumulated Jacobian is equivalent to a specific sequence of sparse matrix multiplications
**Implementation Relevance**:
- Cross-country elimination as an optimization for Jacobian computation
- Markowitz heuristic implementation for elimination ordering
- Graph transformation operations (vertex/edge/face elimination with fill-in)
- Useful for small-to-medium computational graphs where optimal ordering matters
- Less critical than forward/reverse for initial implementation
**Dependencies**: Chapters 2–3, Chapter 9.
---
### Chapter 11 — Hessians, Higher Derivatives, and Structured Methods (pp. 275–300)
**Summary**: Efficient computation of Hessians exploiting structure: partial separability, symmetric computational graphs, and connections to Chapter 8's second-order methods.
**Key Concepts**:
- **Definition (SG)**: Symmetric Computational Graph — graph where forward and reverse sweeps have identical structure
- Rule 16: Nonincremental reverse mode naturally produces a symmetric graph
- This symmetry is key for efficient Hessian computation
- **Hessian Product Form**: ∇²f = Ź^T D Ź where Ź is the extended tangent matrix and D is diagonal
- Computed by one forward sweep (getting Ź and D) + matrix multiply
- D contains second derivatives of elemental functions
- **Definition (PS)**: Partial Separability — f(x) = Σ f_i(U_i x) where each f_i depends on few variables (U_i selects a subset)
- **Value separability**: Direct sum decomposition of function value
- **Argument separability**: Each component depends on few input variables
- Hessian of partially separable function is sum of low-rank terms
- **Rule 17**: Cheap gradient principle does NOT extend to cheap Jacobians
- Full m×n Jacobian always requires min(m,n) sweeps, no constant-factor bound
- **Rule 18**: Reverse beats forward for Jacobians only when nonlinear matrix width >> nonlinear height
- **Coloring for Hessians**: Exploit symmetry to reduce number of Hessian-vector products
- Star coloring, acyclic coloring for direct/indirect recovery
- Fewer colors needed than for general Jacobians due to symmetry
- **Tables 11.1–11.3**: Complexity comparisons for various Hessian computation strategies
**Implementation Relevance**:
- Partial separability detection and exploitation — common in optimization (each constraint involves few variables)
- Hessian product form (Ź^T D Ź) as an efficient representation
- Star/acyclic coloring for compressed Hessian computation
- Symmetric graph construction for Hessian-specific optimizations
**Dependencies**: Chapters 2–3, 8–9.
---
## Part IV: Memory Management and Advanced Topics (Chapters 12–15)
### Chapter 12 — Reversal Schedules and Checkpointing (pp. 301–340)
**Summary**: The fundamental memory management challenge of reverse mode and its solution via checkpointing. One of the most practically important chapters.
**Key Concepts**:
- **The problem**: Reverse mode needs all intermediate values, requiring O(l) memory for l operations. For long computations (e.g., time-stepping), this is prohibitive.
- **Definition (RS)**: Reversal Schedules — four atomic motions:
1. **Advancing**: Evaluate forward without recording (move forward in time)
2. **Recording**: Evaluate forward while recording to tape (expensive in memory)
3. **Returning**: Move read head back to a checkpoint (no computation)
4. **Reversing**: Execute one step of the adjoint (consumes recorded data)
- **Checkpointing strategy**: Store snapshots at selected time steps; re-evaluate segments between checkpoints as needed during the reverse sweep
- **Binomial checkpointing** (Griewank's Algorithm):
- With c checkpoints and r reversals needed, optimal reach is β(c,r) = C(c+r, c) (binomial coefficient)
- This gives logarithmic growth: l steps need only O(log l) checkpoints for O(l log l) total operations
- Optimal in the class of offline schedules
- Rule 21: Relative reversal cost grows like l^(1/c) — polynomial in l for fixed c
- **Online checkpointing**: When total number of steps l is not known in advance
- Revolve algorithm and its online variants
- Slightly suboptimal but practical for adaptive computations
- **Rule 19**: Joint reversals save memory; split reversals save operations
- **Rule 20**: All taping during call tree reversals occurs in LIFO fashion
- **Parallel checkpointing**: Distribute checkpoint storage and recomputation across processors
- **Multi-level checkpointing**: Different storage tiers (registers, RAM, disk) with different costs
- **Figures 12.1–12.8**: Diagrams of various checkpointing schedules
- **Tables 12.1–12.3**: Optimal checkpoint placement and operation counts
- **Call tree reversal**: Checkpointing applied to recursive/hierarchical program structure, not just flat time-stepping
**Implementation Relevance**: Critical for practical reverse mode:
- **Checkpoint manager**: Allocate, store, and retrieve checkpoints
- **Revolve algorithm**: Implement the binomial checkpointing schedule
- **Online variant**: Support unknown-length computations
- **Multi-level storage**: Abstract over memory tiers
- **Integration with tape**: Checkpoints interact with the taping system (Chapter 7)
- This is where the "engineering" of reverse mode really lives
**Dependencies**: Chapters 2–7.
---
### Chapter 13 — Taylor and Tensor Coefficients (pp. 301–340)
**Summary**: Higher-order derivative computation via Taylor arithmetic — a unified framework that extends first-order AD to arbitrary derivative order.
**Key Concepts**:
- **Taylor Coefficient Functions (Definition TC)**: For v(t) = Σ v_k t^k, the coefficients v_k encode k-th derivative information: v_k = (1/k!) d^k v/dt^k |_{t=0}
- **Definition (AI)**: Approximants of Intermediates — truncated Taylor series to degree d
- **Taylor arithmetic**: Extend each elemental operation to operate on truncated Taylor series (polynomials of degree d)
- Addition: coefficient-wise
- Multiplication: convolution (v·w)_k = Σ_{j=0}^{k} v_j · w_{k-j}
- Division, transcendentals: recurrence relations derived from defining equations
- Tables 13.1–13.4: Complete rules for all elementals in Taylor arithmetic
- **Rule 22**: Forward differentiation on scalars is encapsulated by extension to truncated Taylor series (Taylor arithmetic)
- **Univariate Taylor Propagation (UTP)**: The forward sweep in Taylor arithmetic
- Cost: O(d²) × TIME(F) for degree-d Taylor coefficients of all intermediates
- **Higher-order Jacobians**: Proposition 13.3 — ∂y_j/∂x_i follows a shifted structure (A_{j-i})
- Corollary 13.1: Extension operator E_d and differentiation operator D commute
- **Higher-Order Adjoints**: All adjoint vectors F̄_j for j < d obtained from one reverse sweep in Taylor arithmetic
- Cost: (1+d)² × TIME(F)
- Rule 23: What works for first derivatives yields higher derivatives by extension to Taylor arithmetic
- **ODE Integration via Taylor Series**:
- For ẋ(t) = F(x(t)), coefficients satisfy x_{k+1} = F_k(x_0,...,x_k)/(k+1)
- Recursive computation: cost O(d²) for d coefficients, O(d³/3) for d coefficients plus their Jacobian
- **Sensitivity equation**: X'(t) = F'(x(t))X(t), with X_k recursion (eq 13.23)
- **Coefficient Doubling** (Table 13.7): Use Newton's method to compute d Taylor coefficients in O(log₂(d+2)) sweeps instead of d sweeps
- Based on Corollary 13.2: Linearity in Higher Coefficients — y_k is linear in x_j for j > k/2
- **DAE Support**: Differential-algebraic equations 0 = F_k(z_0,...,z_k,(k+1)z_{k+1})
- Block Hessenberg Jacobian structure (eq 13.27)
- Requires consistent initialization
**Implementation Relevance**:
- **Taylor number type**: Polynomial coefficient arrays with arithmetic operations
- **UTP engine**: Forward propagation of Taylor coefficients through all elementals
- **Higher-order adjoint**: Reverse sweep on Taylor arithmetic for gradients of higher-order coefficients
- **ODE solver integration**: Taylor series method for ODE integration (high-order, A-stable)
- **Coefficient doubling**: Newton-based acceleration for long Taylor expansions
- This is a major feature differentiator — most AD libraries only do first order
**Dependencies**: Chapters 2–7 (for the higher-order adjoint part, also Chapter 8).
---
### Chapter 14 — Differentiation without Differentiability (pp. 341–366)
**Summary**: Handling programs with branches, absolute values, min/max, and other nonsmooth operations. Extends AD to piecewise differentiable functions.
**Key Concepts**:
- **Definition (PD)**: Piecewise Differentiability — function is smooth on each piece of a finite partition of the domain, with pieces defined by "selection functions" (branch conditions)
- **Proposition 14.1** (Fischer's Result): On the interior of each piece's domain, derivatives are well-defined and computable by standard AD
- **Rule 24**: When program branches apply on open subdomains (generic case), AD yields correct derivatives
- **Four categories of nonsmooth elementals**:
1. **Kinks**: |v|, min(u,v), max(u,v) — continuous but not differentiable at isolated points
2. **Roots**: √u, arcsin(u), arccos(u) — bounded values but infinite derivatives at boundary
3. **Steps**: sign(u), Heaviside function — discontinuous; encode branch selection
4. **Poles**: 1/u, log(u), tan(u) — infinite values at boundary
- **Convention**: ±∞ × 0 = 0 = 0 × NaN (zero annihilates exceptional values)
- **Assumption (IC)**: Isolated Criticalities — critical points (where nonsmoothness occurs) are isolated on any line through the domain
- **Definition (SD)**: Stable Domain — subset where the same selection functions apply in a neighborhood
- **Proposition 14.2** / Rule 25: Functions given by evaluation procedures are almost everywhere (full-measure) either real analytic or stably undefined. The complement (where derivatives don't exist) has measure zero.
- **Laurent Numbers (Definition LN)**: Extended number system for tracking behavior near critical points
- Leading exponent e_v, significant degree d_v, coefficient array
- Special values: NaL (not-a-Laurent = NaN analogue), zero, log singularities
- Tables 14.6–14.8: Arithmetic on Laurent numbers (addition, univariates, conditionals)
- Proposition 14.3: Laurent arithmetic consistently propagates the Laurent model
- **Definition (RA)**: Regular Arc — parameterized curve along which function has well-defined Laurent expansion with finite determinacy degree d_*
- **Definition (FS)**: Finite Slopes — no roots or poles among critical elementals
- **Proposition 14.5**: Under finite slopes, the limiting Jacobian along a regular arc exists and belongs to Clarke's generalized gradient
- **Generalized Jacobians**: Clarke's generalized gradient ∂f = conv{g | g = lim ∇f(x_k), x_k → x}
**Implementation Relevance**:
- **Abs/min/max support**: Must handle kink functions correctly, propagating derivatives from the active branch
- **NaN/Inf propagation**: Follow the convention ±∞ × 0 = 0
- **Branch tracking**: Record which branch was taken during evaluation; derivatives valid on the interior of each branch region
- **Laurent arithmetic** (optional advanced feature): For analyzing behavior near critical points
- **Generalized gradients**: Support for nonsmooth optimization (e.g., L1 regularization, ReLU in neural networks)
- Practically: most users hit this via abs, min, max, ReLU, and conditional branches
**Dependencies**: Chapters 2–3, conceptually Chapter 13 for Laurent numbers.
---
### Chapter 15 — Implicit and Iterative Differentiation (pp. 367–398)
**Summary**: Computing derivatives through implicit function definitions and iterative solvers (Newton's method, fixed-point iterations) without differentiating every iteration.
**Key Concepts**:
- **Implicit Function Theorem application**: Given F(z,x) = 0 defining z = z(x) implicitly:
- **Implicit tangent**: ż_* = −F_z(z_*,x)^{−1} · F_x(z_*,x) · ẋ (eq 15.8)
- **Implicit adjoint**: w̄_* = F_z(z_*,x)^{−T} · f_z(z_*,x)^T · ȳ (eq 15.9)
- Avoids differentiating the solver iterations — only needs converged solution and Jacobian of the residual
- **Assumption (JR)**: Jacobian Regularity — F_z(z_*,x) is nonsingular at the solution
- **Reduced Jacobian**: dy/dx expressed as a generalized Schur complement (eq 15.5)
- Two bracketing options: forward (eq 15.6) vs reverse (eq 15.7)
- **Direct/Adjoint Sensitivity Equations** (eqs 15.10–15.11):
- Direct: F_z · Ż + F_x = 0 → solve for Ż
- Adjoint: F_z^T · W̄ + ... = 0 → solve for W̄
- **Derivative Quality Criteria** (Lemma 15.1): Forward and reverse consistency check — if both forward and reverse derivatives agree, solution accuracy is confirmed
- **Corrected Function Estimate** (Corollary 15.1): Using adjoint information to correct function values, doubling convergence order
- **Two-Phase Approach**: First converge the iteration, then differentiate:
- Phase 1: Run iteration to convergence
- Phase 2: Apply implicit differentiation at converged point
- **Piggyback Approach**: Propagate derivatives alongside iteration:
- Delayed piggyback: start derivative propagation partway through iteration
- Coupled iteration of z and ż (eq 15.25)
- Proposition 15.1: Derivative convergence factor ≤ ρ_* (same as iteration)
- **Assumption (GC)**: Global Contractivity — ||G'_k|| ≤ ρ < 1 (contraction mapping)
- **Newton's method specifics**: G_k(z,x) = z − F_z(z,x)^{−1} F(z,x)
- Quadratic convergence of function → linear convergence of derivatives via piggyback
- **Preconditioner Deactivation**: Simplified derivative recurrence when preconditioner is frozen (eq 15.29)
- Tables 15.1–15.2: Direct and adjoint fixed point iteration algorithms
- **Rule 26**: Fixed point iterations can and should be adjoined by recording only single steps (not the entire iteration history)
- **Second-Order Adjoint Fixed Point** (Table 15.3): Simultaneous convergence of z, ż, w̄, and ŵ (second-order adjoint)
- **Extended Computational Graph** (Figure 15.7): Visualization of the coupled iteration structure
- **Convergence rates**: All derivative orders converge at ρ_*^k with polynomial lag factor k^d for d-th order
**Implementation Relevance**:
- **Implicit differentiation**: Core feature for differentiating through linear/nonlinear solvers
- `implicit_tangent(F, z_star, x, x_dot)` and `implicit_adjoint(F, z_star, x, y_bar)` operations
- **Fixed-point iteration differentiation**: Piggyback approach for iterative methods
- Record single iteration step, not entire history (Rule 26)
- **Integration with linear algebra**: Needs linear solves (F_z^{-1} · rhs) for implicit diff
- **Newton solver wrapper**: Differentiate through Newton's method transparently
- Critical for: PDE-constrained optimization, training equilibrium models, differentiating physics simulations with implicit time stepping
**Dependencies**: Chapters 2–7, Chapter 8 for second-order.
---
## Appendix: Cross-Cutting Concerns
### Complete Rule Index
| 1 | Problem functions should be evaluated as compositions of elementals | 2 |
| 2 | Evaluate derivatives along with function values | 2 |
| 3 | To propagate tangents, accumulate the chain rule from right to left (forward) | 3 |
| 4 | To propagate adjoints, accumulate the chain rule from left to right (reverse) | 3 |
| 5 | Gradient by reverse mode costs ≤ 4× function evaluation | 3 |
| 6 | Separate linear and nonlinear operations in cost analysis | 4 |
| 7 | If function can be evaluated, so can its Jacobian–vector product (forward) | 5 |
| 8 | If function can be evaluated, so can its transpose-Jacobian–vector product (reverse) | 5 |
| 9 | Exploit structure (sparsity, symmetry) in Jacobian computation | 5 |
| 10 | Operator overloading is simplest AD implementation approach | 6 |
| 11 | Source transformation gives best performance | 6 |
| 12 | Taping with compression reduces memory overhead | 7 |
| 13 | Preaccumulation of straight-line segments reduces tape size | 7 |
| 14 | Split reversals for parallelism, joint reversals for memory | 7 |
| 15 | Hessian-vector products via forward-over-reverse | 8 |
| 16 | Nonincremental reverse makes graph symmetric for Hessians | 11 |
| 17 | Cheap gradient does NOT yield cheap Jacobians | 11 |
| 18 | Reverse beats forward only when nonlinear width >> nonlinear height | 11 |
| 19 | Joint reversals save memory, split reversals save operations | 12 |
| 20 | All taping during call tree reversals occurs in LIFO fashion | 12 |
| 21 | Relative reversal cost grows like l^(1/c) | 12 |
| 22 | Forward differentiation encapsulated in Taylor arithmetic | 13 |
| 23 | First-derivative techniques extend to higher derivatives via Taylor arithmetic | 13 |
| 24 | Program branches on open subdomains → AD yields useful derivatives | 14 |
| 25 | Functions from eval procedures are a.e. real analytic or stably undefined | 14 |
| 26 | Fixed point iterations adjoined by recording only single steps | 15 |
### Complete Formal Definitions and Assumptions
**Assumptions:**
| ED | Elemental Differentiability | 23 | 2 |
| TA | Temporal Additivity of Task | 74 | 4 |
| EB | Elemental Task Boundedness | 78 | 4 |
| PC | Path Connectedness | 149 | 7 |
| IC | Isolated Criticalities | 348 | 14 |
| JR | Jacobian Regularity | 371 | 15 |
| GC | Global Contractivity | 378 | 15 |
**Definitions:**
| MW | Matrix Width, Maximal Domain, Range Size | 149 | 7 |
| BP | Biclique Property | 206 | 10 |
| LP | Local Procedure and Jacobian | 222 | 9 |
| SC | Jacobian Dimension and Scarcity | 227 | 9 |
| SG | Symmetric Computational Graph | 238 | 11 |
| PS | Partial Separability | 252 | 11 |
| RS | Reversal Schedules | 280 | 12 |
| TC | Taylor Coefficient Functions | 304 | 13 |
| AI | Approximants of Intermediates | 304 | 13 |
| PD | Piecewise Differentiability | 342 | 14 |
| SD | Stable Domain | 348 | 14 |
| LN | Laurent Number | 351 | 14 |
| RA | Regular Arc and Determinacy Degree | 360 | 14 |
| FS | Finite Slopes | 361 | 14 |
---
## Suggested Implementation Phases
Based on the dependency structure and practical utility, here is a suggested phasing for echidna development:
### Phase 1: Core Foundation
**Chapters 2–3** — Evaluation procedure model, computational graph, forward mode (dual numbers), reverse mode (basic tape)
Deliverables:
- Dual number type with operator overloading
- Basic tape-based reverse mode
- Elemental function library (all operations from Tables 3.2–3.4)
- Jacobian-vector and vector-Jacobian products
### Phase 2: Production Reverse Mode
**Chapters 4, 7, 12** — Memory management, efficient taping, checkpointing
Deliverables:
- Optimized tape data structures (three-stack design)
- Preaccumulation for straight-line segments
- Binomial checkpointing (Revolve algorithm)
- Online checkpointing for unknown-length computations
- Memory-bounded reverse mode for large-scale problems
### Phase 3: Second-Order Derivatives
**Chapters 5, 8** — Jacobian propagation, Hessian-vector products, forward-over-reverse
Deliverables:
- Matrix-mode tangent and adjoint propagation
- Forward-over-reverse Hessian-vector products
- Second-order elemental rules
- Composable mode nesting
### Phase 4: Sparse Computation
**Chapters 9, 11** — Sparse Jacobians, compressed computation, graph coloring
Deliverables:
- Sparsity pattern detection (index domain/range propagation)
- Graph coloring for seed generation (distance-2, star, acyclic)
- Compressed Jacobian and Hessian computation
- Partial separability detection and exploitation
- CPR/NR seeding strategies
### Phase 5: Advanced Jacobian Accumulation
**Chapter 10** — Cross-country elimination, Markowitz heuristic
Deliverables:
- Vertex/edge/face elimination on computational graphs
- Markowitz-based elimination ordering
- Optimal Jacobian accumulation for small graphs
### Phase 6: Higher-Order Derivatives
**Chapter 13** — Taylor arithmetic, univariate Taylor propagation, higher-order adjoints
Deliverables:
- Taylor number type (truncated polynomial arithmetic)
- UTP engine for all elementals
- Higher-order adjoint computation
- ODE integration via Taylor series
- Coefficient doubling (Newton acceleration)
### Phase 7: Nonsmooth Extensions
**Chapter 14** — Piecewise differentiable functions, abs/min/max, generalized gradients
Deliverables:
- Correct handling of abs, min, max, ReLU, conditional branches
- Branch tracking and active set identification
- Generalized gradient computation (Clarke subdifferential)
- Optional: Laurent number arithmetic for singularity analysis
### Phase 8: Implicit and Iterative Differentiation
**Chapter 15** — Implicit function theorem, fixed-point iteration differentiation
Deliverables:
- Implicit tangent and adjoint computation
- Piggyback differentiation for iterative solvers
- Newton solver with transparent differentiation
- Second-order implicit derivatives
- Fixed-point iteration adjoint (single-step recording per Rule 26)
---
## Key Design Decisions for echidna
These cross-cutting decisions are resolved in [design-principles.md](design-principles.md). Summary:
1. **Numeric type**: Generic over `Float` trait (`f32`/`f64`)
2. **Tape representation**: SoA bytecode tape (opcodes, arg_indices, values, adjoints as separate contiguous arrays)
3. **Implementation approach**: Operator overloading via Rust traits (eager mode), with optional deferred graph mode for advanced optimisations
4. **Memory allocation**: `bumpalo` arena for tapes, const-generic stack allocation for small tangent vectors
5. **Parallelism**: SIMD (intra-core via `wide`) + `rayon` (inter-core), coarse-grained parallelism only
6. **GPU**: `wgpu` (cross-platform) primary, `cudarc` (NVIDIA) optional, all behind feature flags with CPU fallback
7. **Linear algebra**: `nalgebra` for both fixed and dynamic sizes
8. **Error handling**: NaN propagation on hot paths (per Ch 14), `Result` on cold paths
9. **API**: Closure-based for common cases, builder pattern for advanced config, trait-based for library integration
10. **Composability**: Mode nesting via generic type composition (forward-over-reverse = `Dual<Tape<T>>`)