deep-delta-learning 0.1.1

# deep-delta-learning

[![CI](https://github.com/AbdelStark/ddl-rs/actions/workflows/ci.yml/badge.svg)](https://github.com/AbdelStark/ddl-rs/actions/workflows/ci.yml)
[![License: MIT](https://img.shields.io/badge/License-MIT-violet.svg)](LICENSE)
[![Rust](https://img.shields.io/badge/rust-1.89%2B-orange.svg)](https://www.rust-lang.org)

A Rust implementation of **Deep Delta Learning** (DDL) on [Burn](https://burn.dev) — rank-deficient residual operators with learnable spectral properties for transformer language models.

```
x' = x + k * beta * (v - k^T x)
         \_________________________/
          correction toward target v
          scaled by projection direction k
          and learnable strength beta in (0, 2)
```

DDL replaces the standard residual connection `x' = x + f(x)` with a geometrically-motivated update that projects the hidden state toward a learned target along a learned direction. The scalar `beta` controls the spectral regime of each layer — from near-identity (beta -> 0) through projection (beta -> 1) to reflection (beta -> 2) — giving the model a continuous knob between "change nothing" and "invert the residual stream."

## Status

As of 2026-03-17, `deep-delta-learning` is **alpha**. It is suitable for CPU-based research
prototypes, CLI experiments, checkpoint roundtrips, and benchmark/report generation. It is **not**
yet suitable for production inference services, large-scale training, or a semver-stable public
API. Known limitations include CPU/ndarray-only execution, in-memory toy dataset workflows, and
expected API churn before 1.0. No specific breaking change is scheduled next, but pre-1.0 API
cleanup should be expected.

See [CHANGELOG.md](CHANGELOG.md) for user-visible changes and
[docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for the current system/runbook view.

## Installation

```bash
git clone https://github.com/AbdelStark/ddl-rs.git
cd ddl-rs
cargo build --release
```

Requires Rust 2024 edition. CPU/ndarray backend only (no GPU required).

## 30-Second Demo

```bash
# Train all 6 variants, compare, and benchmark — one command
cargo run --release --bin ddl -- train \
  --train-tokens "1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16" \
  --valid-tokens "5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4" \
  --all-variants --out /tmp/ddl-demo --seq-len 8 --max-steps 32

# Interactive TUI explorer (generates demo data if run with no args)
cargo run --release --features tui --bin ddl-tui
```

## The Delta Operator

The core primitive is a **rank-1 spectral update**. Given hidden state `x in R^d`, direction `k in R^d` (unit-normalized), target scalar `v in R`, and mixing coefficient `beta in (0, 2)`:

```
                                    Vector form (d_value = 1)
                                    ─────────────────────────
                                    proj = k^T x
                                    x'   = x + beta * k * (v - proj)

                                    Matrix form (d_value > 1)
                                    ─────────────────────────
                                    P  = k x^T         [d, d_value]
                                    x' = x + beta * k * (V - P)
```

**Spectral properties** of the induced linear map `A = I - beta * k k^T`:

| Property | Formula | Interpretation |
|----------|---------|----------------|
| k-eigenvalue | `1 - beta` | Scaling along projection direction |
| Orthogonal eigenvalues | `1` | Unchanged in the nullspace of k |
| Determinant | `1 - beta` | Volume scaling (vector state) |
| Lifted determinant | `(1 - beta)^d_value` | Volume scaling (matrix state) |

The eigenvalue `1 - beta` defines five **spectral regimes**:

```
beta:    0.0    0.3      0.7      1.3      1.7      2.0
          |------|--------|--------|--------|--------|
          Near   Inter-   Near     Near     Strong
          Identity polating Projection Reflection Reflection

          "keep x"         "project"           "invert"
```

## Architecture

A DDL transformer block replaces the standard `x + Attn(x)` and `x + MLP(x)` with:

```
                    ┌─────────────────────┐
                    │     Input x         │
                    └──────────┬──────────┘
                               │
                    ┌──────────▼──────────┐
                    │    RMSNorm + MHA    │──── backbone_out
                    └──────────┬──────────┘
                               │
            ┌──────────────────┼──────────────────┐
            │                  │                  │
     ┌──────▼──────┐   ┌──────▼──────┐   ┌──────▼──────┐
     │  k-branch   │   │  v-branch   │   │ beta-branch │
     │  k = f(out) │   │  v = g(out) │   │ beta = h(x) │
     │  ||k|| = 1  │   │             │   │ 0 < beta < 2│
     └──────┬──────┘   └──────┬──────┘   └──────┬──────┘
            │                 │                  │
            └─────────┬───────┘                  │
                      │                          │
              ┌───────▼──────────────────────────▼───┐
              │  x' = x + beta * k * (v - k^T x)    │
              │       delta residual update          │
              └──────────────┬───────────────────────┘
                             │
                  ┌──────────▼──────────┐
                  │  RMSNorm + SwiGLU   │──── repeat for MLP sublayer
                  └──────────┬──────────┘
                             │
                  ┌──────────▼──────────┐
                  │     Output x'       │
                  └─────────────────────┘
```

Each transformer block applies the delta residual update **twice** — once after attention, once after the MLP — giving `2 * num_layers` total beta/k/v branches.

## Model Variants

Six variants explore the design space along three axes: **state rank** (scalar vs matrix), **compression** (token-conv vs channel-conv), and **embedding expansion**.

| Variant | CLI slug | d_value | Compression | Embed Conv | Description |
|---------|----------|---------|-------------|------------|-------------|
| Baseline | `baseline` | — | — | — | Standard transformer (no DDL) |
| DDL Vector | `ddl` | 1 | TokenConv | No | Rank-1 delta operator. Cheapest DDL variant. |
| DDL Matrix | `ddl-tokenconv` | d_value | TokenConv | No | Matrix-state delta with token compression. |
| DDL Matrix+EC | `ddl-ec` | d_value | TokenConv | Yes | + embedding expansion convolution. |
| DDL Channel | `ddl-cc` | d_value | ChannelConv | No | Matrix-state with channel compression. |
| DDL Channel+EC | `ddl-cc-ec` | d_value | ChannelConv | Yes | Full variant: channel compression + embed conv. |

**Baseline** uses a standard residual `x + f(x)`. All DDL variants replace it with the delta update. The matrix variants (`d_value > 1`) maintain a `[batch, seq, d_model, d_value]` state tensor, giving each position a richer internal representation at the cost of `O(d_value)` more computation per layer.

## CLI Reference

### train

Train one or more variants and save checkpoints + reports.

```bash
ddl train \
  --train-file data/train.txt --valid-file data/valid.txt \
  --out artifacts/ --all-variants \
  --d-model 128 --layers 6 --heads 8 --vocab-size 32000 --d-value 4 \
  --max-steps 1000 --eval-interval 100 --learning-rate 3e-4 \
  --warmup-steps 50 --weight-decay 0.1 --grad-clip 1.0
```

Outputs per-variant checkpoint directories and `training_comparison.json` with loss curves, spectral snapshots at each eval interval, and variant rankings.

`--resume <dir>` continues training from a saved single-variant checkpoint.

### compare

Evaluate saved checkpoints on a dataset.

```bash
ddl compare \
  --artifact artifacts/ddl --artifact artifacts/baseline \
  --data-file data/test.txt --out comparison.json \
  --diagnostics spectral --best-validation
```

### benchmark

Measure operator, block, and full-model latency/memory.

```bash
ddl benchmark --all-variants --iterations 50 --warmup 10 --out bench.json

# Compare against a saved baseline with regression gates
ddl benchmark --all-variants --out current.json \
  --baseline saved_baseline.json --comparison-out regression.json \
  --min-throughput-ratio 0.9 --max-p95-latency-ratio 1.2
```

Suites: `operator`, `block`, `compressor`, `normalization`, `model` (default: all).

### generate

Autoregressive token generation from a checkpoint.

```bash
ddl generate \
  --checkpoint artifacts/ddl --prompt-tokens "1,2,3,4" \
  --max-new-tokens 64 --do-sample --temperature 0.8 --top-k 50
```

## TUI Explorer

An interactive terminal UI for visualizing training and benchmark results.

```bash
# Auto-generates demo data — zero config needed
cargo run --release --features tui --bin ddl-tui

# Or load your own reports
cargo run --release --features tui --bin ddl-tui -- \
  artifacts/training_comparison.json bench.json
```

<table>
<tr>
<td align="center"><strong>Dashboard</strong> — Variant Comparison<br><img src="docs/assets/img/tui-dashboard.png" width="420" alt="Dashboard view showing variant comparison table with loss deltas and model/training config panels"></td>
<td align="center"><strong>Training</strong> — Loss Curves & Metrics<br><img src="docs/assets/img/tui-training.png" width="420" alt="Training view showing loss chart with train/validation curves, LR schedule, and step metrics table"></td>
</tr>
<tr>
<td align="center"><strong>Spectral</strong> — Beta & Eigenvalue Diagnostics<br><img src="docs/assets/img/tui-spectral.png" width="420" alt="Spectral view showing beta per layer bars with regime colors, regime distribution, k-eigenvalue and k-coherence"></td>
<td align="center"><strong>Benchmark</strong> — Throughput & Memory<br><img src="docs/assets/img/tui-benchmark.png" width="420" alt="Benchmark view showing model throughput bars, component timing table, and memory usage"></td>
</tr>
</table>

Navigate with `1-4` / `Tab` to switch views, `j/k` to select variants, `?` for help, `q` to quit.

## Architecture & Operations

The short version:

- `ddl` is the shipped automation surface for train/compare/benchmark/generate.
- checkpoints are explicit directories with manifest, weights, report, optimizer state, and
  optional `best_validation/` state
- invalid evaluation inputs are expected to fail with typed/library or CLI errors rather than
  degrade into placeholder metrics

For the full component map, invariants, validation commands, and failure-mode runbook, see
[docs/ARCHITECTURE.md](docs/ARCHITECTURE.md).

## Configuration Reference

### DdlConfig

| Field | Default | Description |
|-------|---------|-------------|
| `d_model` | *required* | Hidden dimension. Must be divisible by `num_heads`. |
| `num_layers` | *required* | Number of transformer blocks. |
| `num_heads` | *required* | Number of attention heads. |
| `vocab_size` | *required* | Token vocabulary size. |
| `d_value` | 1 | Delta state rank. 1 = vector (rank-1). >1 = matrix state. |
| `head_dim` | d_model/heads | Per-head dimension. |
| `mlp_hidden` | ceil(8d/3) | SwiGLU intermediate dimension. |
| `mapping` | KMap | Branch mapping: `KMap` or `VMap`. |
| `beta_init` | 0.5 | Initial beta value. Must be in (0, 2). |
| `beta_single_linear` | true | Single-linear beta gate. false = two-layer. |
| `k_eps` | 1e-6 | k-direction normalization epsilon. |
| `compression` | TokenConv | `TokenConv` or `ChannelConv`. |
| `shortconv_kernel_size` | 4 | Causal depthwise conv kernel size. |
| `embed_conv` | false | Enable embedding expansion convolution. |
| `max_seq_len` | 128 | Maximum sequence length (RoPE). |
| `rope_theta` | 10000.0 | RoPE frequency base. |

### TrainingConfig

| Field | Default | Description |
|-------|---------|-------------|
| `max_steps` | 32 | Total training steps per phase. |
| `eval_interval` | 4 | Steps between validation + spectral snapshots. |
| `learning_rate` | 1e-3 | Peak learning rate. |
| `min_learning_rate` | 0.0 | Cosine decay floor. |
| `warmup_steps` | 0 | Linear warmup steps. |
| `weight_decay` | 0.1 | AdamW weight decay. |
| `beta1` / `beta2` | 0.9 / 0.95 | AdamW momentum coefficients. |
| `epsilon` | 1e-5 | AdamW epsilon. |
| `grad_clip` | 1.0 | Gradient norm clipping threshold. |

## Library Usage

```rust
use burn::backend::{Autodiff, NdArray};
use deep_delta_learning::*;

fn main() -> Result<(), Box<dyn std::error::Error>> {
type B = Autodiff<NdArray>;
let device = Default::default();

// Configure and train
let config = DdlConfig::new(64, 2, 4, 256);
let training = TrainingConfig::new().with_max_steps(100);
let batcher = TokenBatcher::try_new(TokenBatchingConfig::try_new(4, 16)?)?;
let dataset = batcher.batch_tokens(&tokens);
let val_dataset = dataset.clone();

let outcome = train_variant::<B>(
    &config, ModelVariant::DdlVector, &device,
    &dataset, Some(&val_dataset), &training,
)?;

// Inspect results
println!("Final loss: {:.4}", outcome.report.final_train.loss);
println!("Parameters: {}", outcome.report.num_params);

// Spectral diagnostics
if let Some(spectral) = &outcome.report.final_train_spectral {
    for (i, regime) in spectral.regime_per_layer.iter().enumerate() {
        println!("Layer {i}: beta={:.3} regime={:?}",
            spectral.beta_per_layer[i], regime);
    }
}

// Save and reload
save_training_artifact(&outcome, std::path::PathBuf::from("checkpoint/"))?;
Ok(())
}
```

## Project Structure

```
src/
  delta_operator.rs     Core rank-1 spectral operator: x' = x - beta * k(k^T x)
  delta_res_block.rs    Residual update: x' = x + beta * k * (v - k^T x)
  beta_branch.rs        Learnable beta in (0, 2) via sigmoid scaling
  k_branch.rs           Unit-normalized projection direction
  v_branch.rs           Scalar and matrix target value branches
  transformer.rs        DDL transformer block and stack
  baseline.rs           Standard transformer (no DDL) for comparison
  variant.rs            6 model variants and config resolution
  spectral.rs           Regime classification, per-layer diagnostics, collectors
  training.rs           Training loop with cosine-warmup LR, spectral snapshots
  checkpoint.rs         Checkpoint save/load with manifest and optimizer state
  eval.rs               Variant evaluation and comparison ranking
  benchmark.rs          Timing, memory profiling, and regression gates
  generation.rs         Autoregressive generation with top-k/top-p sampling
  config.rs             Model configuration and validation
  data.rs               Token batching and dataset construction
  attention.rs          Multi-head attention with RoPE
  mlp.rs                SwiGLU feed-forward
  compressor.rs         Token-conv and channel-conv state compression
  bin/ddl/              CLI: train, compare, benchmark, generate
  bin/ddl-tui/          Interactive terminal UI (ratatui)
tests/                  Integration tests for all modules
examples/               End-to-end demos (train_tiny, compare_models, spectral_report, ...)
fixtures/               Benchmark regression baselines and sample data
```

## Stack

| Layer | Technology |
|-------|-----------|
| Language | Rust (2024 edition) |
| ML framework | [Burn](https://burn.dev) 0.20 |
| Backend | ndarray + autodiff (CPU) |
| Serialization | serde / serde_json |
| TUI | ratatui + crossterm (optional, `--features tui`) |

## Development

Validation commands that are expected to pass today:

```bash
cargo check
cargo test
cargo test --examples
cargo build --features tui
cargo clippy --all-targets --all-features -- -D warnings
```

Coverage is enforced in CI with `cargo-llvm-cov` over the library/test contract, excluding
`src/bin/` entrypoints that currently only have smoke coverage. Benchmark policy is contract-based:
CI verifies benchmark report generation and regression-gate behavior, but does not yet enforce
absolute latency budgets on hosted runners.

See [CONTRIBUTING.md](CONTRIBUTING.md) for the contributor workflow.

## Known Limitations

- CPU/ndarray backend only; no GPU path is enabled in the manifest.
- Training and evaluation workflows assume local, in-memory token datasets rather than streaming
  corpora.
- Benchmark memory numbers are estimates intended for regression tracking, not allocator-precise
  telemetry.
- The crate is pre-1.0; public APIs may change as typed error handling and docs continue to harden.

## Roadmap

1. Contract hardening: finish typed user-boundary errors and expand rustdoc coverage across the
   public API.
2. Research workflow depth: improve dataset ergonomics, artifact metadata, and richer diagnostics.
3. Backend expansion: evaluate GPU-enabled Burn backends only after CPU semantics remain stable.

## Changelog

User-visible changes are tracked in [CHANGELOG.md](CHANGELOG.md).

## Help

- File bugs and contract mismatches in the GitHub issue tracker:
  `https://github.com/AbdelStark/ddl-rs/issues`
- Check the CI workflow contract in [.github/workflows/ci.yml](.github/workflows/ci.yml).

## License

[MIT](LICENSE)