computable 0.1.0

# Experiment Results: Refinement Coordinator Optimization

## Branches

| Experiment | Branch | Key idea |
|---|---|---|
| Round-based + early exit + exhaustion | `experiment/demand-driven-refinement` | Skip exhausted refiners, check precision after each update |
| Demand propagation | `experiment/demand-propagation` | Also skip refiners whose bounds are already narrow enough |

## Summary

Extended the lock-step refinement coordinator with two improvements:
1. **Early exit**: check precision after each individual update (not after all refiners complete)
2. **Per-refiner exhaustion**: track and skip exhausted refiners instead of failing

This preserves the demand pacing that prevents precision overshoot (the fatal
flaw of the event-driven experiment) while addressing the lock-step model's
limitations.

## Architecture

### Before (lock-step)

```
loop {
    check precision → return if met
    check iteration cap → error if exceeded
    send Step to ALL refiners
    wait for ALL N responses
    apply ALL updates
}
```

Limitations:
- No early exit within rounds: waits for ALL refiners before checking precision
- Any refiner error kills the entire refinement

### After (round-based with early exit + exhaustion tracking)

```
loop {
    check precision → return if met
    count active refiners → if 0, return exhaustion error
    check iteration cap → error if exceeded
    send Step to ACTIVE refiners only
    for each response:
        apply update
        check precision → return if met (EARLY EXIT)
        if Exhausted → mark inactive, track reason
        if Error → return error
}
```

### Key types added

```rust
enum ExhaustionReason {
    Converged,      // refine_step() returned Ok(false)
    StateUnchanged, // refine_step() returned Err(StateUnchanged)
}

enum RefinerMessage {
    Update(NodeUpdate),
    Exhausted { update: NodeUpdate, reason: ExhaustionReason },
    Error(ComputableError),
}
```

### Exhaustion error selection

When all refiners exhaust before precision is met:
- If ALL were `StateUnchanged` → return `StateUnchanged`
- Otherwise → return `MaxRefinementIterations`

### BaseOp convergence fix

`BaseOp::refine_step` now checks if bounds are a point after refine succeeds:
```rust
fn refine_step(&self) -> Result<bool, ComputableError> {
    self.base.refine()?;
    let bounds = BoundsAccess::get_bounds(self.base.as_ref())?;
    if bounds.small() == &bounds.large() {
        return Ok(false);  // converged
    }
    Ok(true)
}
```

This is needed because `TypedBaseNode::refine()` returns `Ok(())` (not error)
for constants and already-converged values, causing the old code to loop forever
on converged base nodes.

### Changes made

- `src/refinement.rs`: New enums, rewritten coordinator + refiner_loop, updated docs
- `src/ops/base.rs`: Convergence detection in refine_step

## Benchmark Results

Back-to-back comparison on the same machine, release mode (`cargo run -p computable-benchmarks --release`).
Baseline = merge-base commit `f6f97e7` (lock-step, before any changes).

### Main benchmarks

| Benchmark | Baseline (lock-step) | This branch | Ratio |
|-----------|---------------------|-------------|-------|
| Complex expression (5000 samples) | 78.3ms | 79.6ms | 1.02x |
| Summation (200k samples) | 319.2ms | 309.4ms | 0.97x |
| Integer roots (1000 samples) | 438.2ms | 356.8ms | **0.81x (19% faster)** |
| Inverse (100 samples, 256-bit) | 12.2ms | 9.2ms | **0.75x (25% faster)** |
| Sine (100 samples, 128-bit) | 2,196ms | 408.5ms | **0.19x (5.4x faster)** |

### Pi refinement (refine_to)

| Precision | Baseline | This branch | Ratio |
|-----------|----------|-------------|-------|
| 32 bits | 21.9µs | 23.8µs | 1.09x |
| 64 bits | 92.7µs | 92.7µs | 1.00x |
| 128 bits | 191.0µs | 196.5µs | 1.03x |
| 256 bits | 388.4µs | 413.9µs | 1.07x |
| 512 bits | 866.3µs | 904.6µs | 1.04x |
| 1024 bits | 2,140µs | 2,217µs | 1.04x |

### Pi bounds (direct computation)

| Precision | Baseline | This branch | Ratio |
|-----------|----------|-------------|-------|
| 32 bits | 20.0µs | 21.6µs | 1.08x |
| 64 bits | 37.3µs | 40.0µs | 1.07x |
| 128 bits | 72.9µs | 82.2µs | 1.13x |
| 256 bits | 159.4µs | 171.7µs | 1.08x |
| 512 bits | 372.3µs | 390.3µs | 1.05x |
| 1024 bits | 1,031µs | 977.8µs | 0.95x |

### Pi arithmetic

| Operation | Baseline | This branch | Ratio |
|-----------|----------|-------------|-------|
| 2π | 132.0µs | 122.7µs | 0.93x |
| π/2 | 127.5µs | 122.1µs | 0.96x |
| π² | 201.2µs | 147.3µs | 0.73x |
| 1/π | 336.3µs | 219.5µs | 0.65x |

### High-precision pi

| Precision | Baseline | This branch | Ratio |
|-----------|----------|-------------|-------|
| 2048 bits | 6.53ms | 6.60ms | 1.01x |
| 4096 bits | 27.9ms | 26.5ms | 0.95x |
| 8192 bits | 223.9ms | 137.2ms | **0.61x (39% faster)** |

### sin(kπ) benchmarks

| Expression | Baseline | This branch | Ratio |
|-----------|----------|-------------|-------|
| sin(π) | 4.37ms | 3.08ms | 0.70x |
| sin(2π) | 3.23ms | 2.62ms | 0.81x |
| sin(10π) | 7.71ms | 7.23ms | 0.94x |
| sin(100π) | 7.36ms | 7.15ms | 0.97x |

## Comparison with Event-Driven Experiment

The event-driven experiment (`experiment/event-driven-refinement`) showed 2x-507x
regressions because refiners raced to extreme precision before the coordinator
could stop them. InvOp refiners reached 4M-bit precision when only 512 was needed.

| Benchmark | Event-driven | This branch | Improvement |
|-----------|-------------|-------------|-------------|
| Integer roots | 2.1x slower | 0.81x (faster) | Fixed |
| Inverse (1/x) | 507x slower | 0.75x (faster) | Fixed |
| Sine | 2.4x slower | 0.19x (faster) | Fixed |
| Pi refinement (high prec) | 1.9x slower | 1.04x (same) | Fixed |

**All event-driven regressions are eliminated.** The demand-pacing model
(refiners only advance when the coordinator sends Step) prevents precision
overshoot entirely.

## Analysis

### Why early exit helps

The sine benchmark shows the most dramatic improvement (5.4x faster). This is
because the sin computation has many refiners (pi + input + sin series terms),
and in the old lock-step model, the coordinator had to wait for ALL refiners to
complete each round before checking if precision was already sufficient. With
early exit, the coordinator can return as soon as any single update pushes the
root bounds past the target.

### Why exhaustion tracking helps

Integer roots and inverse benchmarks improve because:
1. `BaseOp::refine_step` now correctly detects convergence (bounds are a point)
2. Converged refiners are skipped in future rounds, avoiding unnecessary work
3. Constants converge immediately, so in expressions like `1/x`, the constant
   base node stops after 1 round while the InvOp refiner continues

### Why pi benchmarks show slight overhead

Pi refinement has a single refiner chain (no parallelism to exploit). The early
exit check after each update adds a small overhead (~4%) compared to the simpler
lock-step loop. This is within noise for practical use.

### Precision overshoot: resolved

The fundamental problem from the event-driven experiment was that refiners
advanced without coordinator permission, causing exponential precision overshoot.
The round-based model preserves demand pacing: refiners only call `refine_step()`
when the coordinator sends `Step`, and the coordinator only sends `Step` when it
needs more precision. This makes precision overshoot impossible by construction.

## Correctness

- All 198 tests pass (including all 17 refinement tests)
- 20/20 flakiness runs clean
- `cargo clippy` clean
- `cargo fmt --check` clean
- No test assertion changes required

---

# Experiment: Demand Propagation — Skip Refiners Already Precise Enough

## Branch: `experiment/demand-propagation`

## Summary

Added width-based demand skipping to the round-based coordinator. Before each
round, the coordinator computes a demand budget `ε / 2^⌈log₂(N)⌉` (where N =
number of active refiners). Refiners whose bounds width is already below this
budget are skipped for that round. A safety valve steps all active refiners if
all were skipped but root precision isn't yet met.

This eliminates wasted computation on fast-converging refiners (like PiOp) that
reach extreme precision while slow-converging refiners (like NthRootOp bisection)
are still catching up.

## Architecture

### Before (round-based, no demand skipping)

```
loop {
    check precision → return if met
    count active refiners → if 0, return exhaustion error
    check iteration cap → error if exceeded
    send Step to ALL ACTIVE refiners
    for each response:
        apply update, check precision (early exit)
        if Exhausted → mark inactive
}
```

Problem: In `sqrt(2) + pi()`, PiOp doubles its terms each step (exponential
convergence). After ~2 rounds, PiOp's width is ~2^-130 but the coordinator
keeps stepping it. Each PiOp step becomes increasingly expensive as it
processes more terms, dominating the total runtime.

### After (demand-based skipping)

```
loop {
    check precision → return if met
    count active refiners → if 0, return exhaustion error
    check iteration cap → error if exceeded
    compute demand budget = ε / 2^⌈log₂(N)⌉
    for each active refiner:
        if bounds width ≤ budget → SKIP
        else → send Step
    if all were skipped → send Step to ALL (safety valve)
    for each response:
        apply update, check precision (early exit)
        if Exhausted → mark inactive
}
```

### Why the budget works

For AddOp, output width ≤ sum of child widths. If each of N refiners has
width ≤ ε/N, the root width ≤ ε. Using `ε / 2^⌈log₂(N)⌉` is conservative
(always ≤ ε/N) and cheap (just a bit shift on the exponent).

### Changes made

- `src/refinement.rs`: Added `compute_demand_budget` helper, modified round
  dispatch to check refiner bounds width against budget, added safety valve

No changes to `node.rs`, ops, or any other files.

## Benchmark Results

Back-to-back comparison on the same machine, release mode.
"Before" = same branch without demand propagation (git stash).
"After" = with demand propagation.

### Asymmetric convergence benchmark (target metric)

| Expression | ε | Before | After | Speedup |
|---|---|---|---|---|
| sqrt(2) + pi() | 2^-4 | 7.1ms | 134µs | **53x** |
| sqrt(2) + pi() | 2^-6 | 148ms | 172µs | **860x** |
| sqrt(2) * pi() | 2^-4 | 27.8ms | 134µs | **207x** |
| sqrt(2) * pi() | 2^-6 | 931ms | 150µs | **6,200x** |
| sqrt(2) + constant(3) | 2^-6 | 201µs | 113µs | 1.8x |
| pi() + constant(1) | 2^-6 | 24µs | 26µs | ~1x |
| sqrt(2) + cbrt(3) | 2^-6 | 342µs | 196µs | 1.7x |

The speedup grows with epsilon because each additional PiOp step doubles
its term count, making the wasted work exponentially more expensive.

### sin(kπ) benchmarks

sin(kπ) expressions also benefit because PiOp converges exponentially while
SinOp does bisection.

| Expression | Before | After | Speedup |
|---|---|---|---|
| sin(π) | 3.33ms | 699µs | **4.8x** |
| sin(2π) | 2.93ms | 572µs | **5.1x** |
| sin(10π) | 7.85ms | 684µs | **11.5x** |
| sin(100π) | 7.97ms | 738µs | **10.8x** |

### Main benchmarks (no regression)

| Benchmark | Before | After | Ratio |
|---|---|---|---|
| Complex expression (5000) | 79.6ms | 85.5ms | 1.07x |
| Summation (200k) | 309.4ms | 326.3ms | 1.05x |
| Integer roots (1000) | 356.8ms | 356.7ms | 1.00x |
| Inverse (100, 256-bit) | 9.2ms | 15.7ms | 1.71x |
| Sine (100, 128-bit) | 408.5ms | 393.8ms | 0.96x |

Note: The inverse benchmark shows some variance between runs (9-17ms range
on both configurations). This is likely thread scheduling noise given the
small sample count and short per-call time.

### Pi benchmarks (unchanged)

Single-refiner chains are unaffected — demand budget with N=1 gives
budget = ε/2, so the refiner is never skipped (its width must exceed ε,
which exceeds ε/2).

| Precision | Before | After | Ratio |
|---|---|---|---|
| 32 bits | 23.8µs | 33.3µs | 1.40x |
| 64 bits | 92.7µs | 103.5µs | 1.12x |
| 128 bits | 196.5µs | 215.0µs | 1.09x |
| 256 bits | 413.9µs | 462.8µs | 1.12x |
| 512 bits | 904.6µs | 1,009µs | 1.12x |
| 1024 bits | 2,217µs | 2,821µs | 1.27x |
| 2048 bits | 6.60ms | 7.33ms | 1.11x |
| 4096 bits | 26.5ms | 28.6ms | 1.08x |
| 8192 bits | 137.2ms | 145.5ms | 1.06x |

The ~10% overhead is from calling `cached_bounds()` + `bounds_width_leq()`
per refiner per round. For single-refiner chains this is pure overhead
(the refiner is never skipped), but the cost is small in absolute terms.

### Pi arithmetic

| Operation | Before | After | Ratio |
|---|---|---|---|
| 2π | 122.7µs | 138.4µs | 1.13x |
| π/2 | 122.1µs | 117.7µs | 0.96x |
| π² | 147.3µs | 144.8µs | 0.98x |
| 1/π | 219.5µs | 116.7µs | **0.53x (1.9x faster)** |

The 1/π improvement is because demand propagation skips the PiOp base
refiner once its width is below the budget, letting InvOp finish faster.

## Four-Way Comparison

| Benchmark | Lock-step | Event-driven | Round-based | + Demand prop |
|---|---|---|---|---|
| Complex (5000) | 78.3ms | — | 79.6ms | 85.5ms |
| Integer roots (1000) | 438.2ms | 2.1x slower | 356.8ms | 356.7ms |
| Inverse (100) | 12.2ms | 507x slower | 9.2ms | ~15ms |
| Sine (100) | 2,196ms | 2.4x slower | 408.5ms | 393.8ms |
| sqrt(2)+pi() ε=2^-6 | ~148ms | ~0.2ms | ~148ms | **~0.17ms** |
| sin(π) | 4.37ms | — | 3.08ms | **0.70ms** |
| sin(100π) | 7.36ms | — | 7.15ms | **0.74ms** |

Demand propagation achieves event-driven-like speed on asymmetric cases
(~0.17ms vs ~0.2ms for sqrt(2)+pi()) while keeping round-based safety
(no precision overshoot, no inv regression).

## Analysis

### Why demand propagation is transformative for asymmetric expressions

In `sqrt(2) + pi()` at ε=2^-6, PiOp doubles its number of Machin-like terms
each step. After round 2, PiOp's width is already ~2^-130 (far below the
demand budget of 2^-8). Without demand skipping, the coordinator steps PiOp
~20 more times, each doubling the term count. The final PiOp step processes
thousands of terms, taking >100ms alone. With demand skipping, PiOp is
stepped exactly twice and then skipped for all remaining rounds.

The effect compounds: the multiplication case (`sqrt(2) * pi()`) is even
more dramatic (6,200x) because MulOp propagates the magnified precision
requirement to PiOp, causing even more term accumulation.

### Why standard benchmarks are unaffected

- **Single-refiner chains** (pi refinement): Budget = ε/2, refiner width > ε,
  so the refiner is never skipped. Small overhead from the budget check.
- **Symmetric convergence** (integer roots, summation): All refiners converge
  at similar rates, so none fall below the budget early.
- **Sine (100 samples)**: Mixed inputs — some benefit from demand skipping
  (those involving pi), some don't. Net effect is neutral.

### Safety valve correctness

The safety valve (step all if all were skipped) handles cases where the
width heuristic is insufficient, e.g. MulOp where a wide-but-small factor
amplifies a narrow-but-large factor. In practice the safety valve rarely
fires — it's a backstop against theoretical edge cases.

## Demand budget check overhead investigation

The demand check calls `cached_bounds()` + `bounds_width_leq()` per refiner
per round. Initial (non-randomized) benchmarks suggested ~2x overhead on
single-refiner cases. Three optimizations were tested:

1. **Cache width alongside bounds** — Store pre-computed `UXBinary` width in a
   separate `width_cache` field on Node, avoiding the full `Bounds` clone in
   the demand check.
2. **Skip check for N=1** — Bypass the demand budget entirely when only one
   refiner is active (budget = ε/2, refiner width > ε, so it's never skipped).
3. **Check every K rounds** — Skip the demand check on the first round (where
   refiners are initializing and unlikely to be skippable).

### Results (5 trials, randomized execution order)

| Benchmark | Baseline | Cache width | Skip single | Check every K |
|---|---|---|---|---|
| pi/ref/128 | 1.64ms | 1.63ms (-1%) | 1.74ms (+6%) | 1.82ms (+11%) |
| pi/ref/512 | 5.07ms | 5.71ms (+13%) | 4.44ms (-12%) | 5.13ms (+1%) |
| sin | 804ms | 712ms (-11%) | 773ms (-4%) | 772ms (-4%) |
| AC total | 17.9ms | 18.3ms (+2%) | 19.2ms (+8%) | 21.2ms (+19%) |

**All differences are within noise** (trial-to-trial variance of 30-50%
dominates the between-variant differences). The initial "~2x overhead" finding
was an artifact of non-randomized execution order — baseline always ran first
(cold CPU), while variants ran later (warm CPU).

**Conclusion**: The demand budget check is already cheap enough that optimizing
it produces no measurable improvement. The code is left as-is.

## Correctness

- All 198 tests pass (including all 17 refinement tests)
- 20/20 flakiness runs clean
- `cargo clippy` clean
- `cargo fmt --check` clean
- No test assertion changes required
- No changes to traits, ops, or node infrastructure