deep-delta-learning

A Rust implementation of Deep Delta Learning (DDL) on Burn — rank-deficient residual operators with learnable spectral properties for transformer language models.

x' = x + k * beta * (v - k^T x)
         \_________________________/
          correction toward target v
          scaled by projection direction k
          and learnable strength beta in (0, 2)

DDL replaces the standard residual connection x' = x + f(x) with a geometrically-motivated update that projects the hidden state toward a learned target along a learned direction. The scalar beta controls the spectral regime of each layer — from near-identity (beta -> 0) through projection (beta -> 1) to reflection (beta -> 2) — giving the model a continuous knob between "change nothing" and "invert the residual stream."

Status

As of 2026-03-17, deep-delta-learning is alpha. It is suitable for CPU-based research prototypes, CLI experiments, checkpoint roundtrips, and benchmark/report generation. It is not yet suitable for production inference services, large-scale training, or a semver-stable public API. Known limitations include CPU/ndarray-only execution, in-memory toy dataset workflows, and expected API churn before 1.0. No specific breaking change is scheduled next, but pre-1.0 API cleanup should be expected.

See CHANGELOG.md for user-visible changes and docs/ARCHITECTURE.md for the current system/runbook view.

Installation

git clone https://github.com/AbdelStark/ddl-rs.git
cd ddl-rs
cargo build --release

Requires Rust 2024 edition. CPU/ndarray backend only (no GPU required).

30-Second Demo

# Train all 6 variants, compare, and benchmark — one command
cargo run --release --bin ddl -- train \
  --train-tokens "1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16" \
  --valid-tokens "5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4" \
  --all-variants --out /tmp/ddl-demo --seq-len 8 --max-steps 32

# Interactive TUI explorer (generates demo data if run with no args)
cargo run --release --features tui --bin ddl-tui

The Delta Operator

The core primitive is a rank-1 spectral update. Given hidden state x in R^d, direction k in R^d (unit-normalized), target scalar v in R, and mixing coefficient beta in (0, 2):

                                    Vector form (d_value = 1)
                                    ─────────────────────────
                                    proj = k^T x
                                    x'   = x + beta * k * (v - proj)

                                    Matrix form (d_value > 1)
                                    ─────────────────────────
                                    P  = k x^T         [d, d_value]
                                    x' = x + beta * k * (V - P)

Spectral properties of the induced linear map A = I - beta * k k^T:

Property	Formula	Interpretation
k-eigenvalue	`1 - beta`	Scaling along projection direction
Orthogonal eigenvalues	`1`	Unchanged in the nullspace of k
Determinant	`1 - beta`	Volume scaling (vector state)
Lifted determinant	`(1 - beta)^d_value`	Volume scaling (matrix state)

The eigenvalue 1 - beta defines five spectral regimes:

beta:    0.0    0.3      0.7      1.3      1.7      2.0
          |------|--------|--------|--------|--------|
          Near   Inter-   Near     Near     Strong
          Identity polating Projection Reflection Reflection

          "keep x"         "project"           "invert"

Architecture

A DDL transformer block replaces the standard x + Attn(x) and x + MLP(x) with:

                    ┌─────────────────────┐
                    │     Input x         │
                    └──────────┬──────────┘
                               │
                    ┌──────────▼──────────┐
                    │    RMSNorm + MHA    │──── backbone_out
                    └──────────┬──────────┘
                               │
            ┌──────────────────┼──────────────────┐
            │                  │                  │
     ┌──────▼──────┐   ┌──────▼──────┐   ┌──────▼──────┐
     │  k-branch   │   │  v-branch   │   │ beta-branch │
     │  k = f(out) │   │  v = g(out) │   │ beta = h(x) │
     │  ||k|| = 1  │   │             │   │ 0 < beta < 2│
     └──────┬──────┘   └──────┬──────┘   └──────┬──────┘
            │                 │                  │
            └─────────┬───────┘                  │
                      │                          │
              ┌───────▼──────────────────────────▼───┐
              │  x' = x + beta * k * (v - k^T x)    │
              │       delta residual update          │
              └──────────────┬───────────────────────┘
                             │
                  ┌──────────▼──────────┐
                  │  RMSNorm + SwiGLU   │──── repeat for MLP sublayer
                  └──────────┬──────────┘
                             │
                  ┌──────────▼──────────┐
                  │     Output x'       │
                  └─────────────────────┘

Each transformer block applies the delta residual update twice — once after attention, once after the MLP — giving 2 * num_layers total beta/k/v branches.

Model Variants

Six variants explore the design space along three axes: state rank (scalar vs matrix), compression (token-conv vs channel-conv), and embedding expansion.

Variant	CLI slug	d_value	Compression	Embed Conv	Description
Baseline	`baseline`	—	—	—	Standard transformer (no DDL)
DDL Vector	`ddl`	1	TokenConv	No	Rank-1 delta operator. Cheapest DDL variant.
DDL Matrix	`ddl-tokenconv`	d_value	TokenConv	No	Matrix-state delta with token compression.
DDL Matrix+EC	`ddl-ec`	d_value	TokenConv	Yes	+ embedding expansion convolution.
DDL Channel	`ddl-cc`	d_value	ChannelConv	No	Matrix-state with channel compression.
DDL Channel+EC	`ddl-cc-ec`	d_value	ChannelConv	Yes	Full variant: channel compression + embed conv.

Baseline uses a standard residual x + f(x). All DDL variants replace it with the delta update. The matrix variants (d_value > 1) maintain a [batch, seq, d_model, d_value] state tensor, giving each position a richer internal representation at the cost of O(d_value) more computation per layer.

CLI Reference

train

Train one or more variants and save checkpoints + reports.

ddl train \
  --train-file data/train.txt --valid-file data/valid.txt \
  --out artifacts/ --all-variants \
  --d-model 128 --layers 6 --heads 8 --vocab-size 32000 --d-value 4 \
  --max-steps 1000 --eval-interval 100 --learning-rate 3e-4 \
  --warmup-steps 50 --weight-decay 0.1 --grad-clip 1.0

Outputs per-variant checkpoint directories and training_comparison.json with loss curves, spectral snapshots at each eval interval, and variant rankings.

--resume <dir> continues training from a saved single-variant checkpoint.

compare

Evaluate saved checkpoints on a dataset.

ddl compare \
  --artifact artifacts/ddl --artifact artifacts/baseline \
  --data-file data/test.txt --out comparison.json \
  --diagnostics spectral --best-validation

benchmark

Measure operator, block, and full-model latency/memory.

ddl benchmark --all-variants --iterations 50 --warmup 10 --out bench.json

# Compare against a saved baseline with regression gates
ddl benchmark --all-variants --out current.json \
  --baseline saved_baseline.json --comparison-out regression.json \
  --min-throughput-ratio 0.9 --max-p95-latency-ratio 1.2

Suites: operator, block, compressor, normalization, model (default: all).

generate

Autoregressive token generation from a checkpoint.

ddl generate \
  --checkpoint artifacts/ddl --prompt-tokens "1,2,3,4" \
  --max-new-tokens 64 --do-sample --temperature 0.8 --top-k 50

TUI Explorer

An interactive terminal UI for visualizing training and benchmark results.

# Auto-generates demo data — zero config needed
cargo run --release --features tui --bin ddl-tui

# Or load your own reports
cargo run --release --features tui --bin ddl-tui -- \
  artifacts/training_comparison.json bench.json

Navigate with 1-4 / Tab to switch views, j/k to select variants, ? for help, q to quit.

Architecture & Operations

The short version:

ddl is the shipped automation surface for train/compare/benchmark/generate.
checkpoints are explicit directories with manifest, weights, report, optimizer state, and optional best_validation/ state
invalid evaluation inputs are expected to fail with typed/library or CLI errors rather than degrade into placeholder metrics

For the full component map, invariants, validation commands, and failure-mode runbook, see docs/ARCHITECTURE.md.

Configuration Reference

DdlConfig

Field	Default	Description
`d_model`	required	Hidden dimension. Must be divisible by `num_heads`.
`num_layers`	required	Number of transformer blocks.
`num_heads`	required	Number of attention heads.
`vocab_size`	required	Token vocabulary size.
`d_value`	1	Delta state rank. 1 = vector (rank-1). >1 = matrix state.
`head_dim`	d_model/heads	Per-head dimension.
`mlp_hidden`	ceil(8d/3)	SwiGLU intermediate dimension.
`mapping`	KMap	Branch mapping: `KMap` or `VMap`.
`beta_init`	0.5	Initial beta value. Must be in (0, 2).
`beta_single_linear`	true	Single-linear beta gate. false = two-layer.
`k_eps`	1e-6	k-direction normalization epsilon.
`compression`	TokenConv	`TokenConv` or `ChannelConv`.
`shortconv_kernel_size`	4	Causal depthwise conv kernel size.
`embed_conv`	false	Enable embedding expansion convolution.
`max_seq_len`	128	Maximum sequence length (RoPE).
`rope_theta`	10000.0	RoPE frequency base.

TrainingConfig

Field	Default	Description
`max_steps`	32	Total training steps per phase.
`eval_interval`	4	Steps between validation + spectral snapshots.
`learning_rate`	1e-3	Peak learning rate.
`min_learning_rate`	0.0	Cosine decay floor.
`warmup_steps`	0	Linear warmup steps.
`weight_decay`	0.1	AdamW weight decay.
`beta1` / `beta2`	0.9 / 0.95	AdamW momentum coefficients.
`epsilon`	1e-5	AdamW epsilon.
`grad_clip`	1.0	Gradient norm clipping threshold.

Library Usage

use burn::backend::{Autodiff, NdArray};
use deep_delta_learning::*;

fn main() -> Result<(), Box<dyn std::error::Error>> {
type B = Autodiff<NdArray>;
let device = Default::default();

// Configure and train
let config = DdlConfig::new(64, 2, 4, 256);
let training = TrainingConfig::new().with_max_steps(100);
let batcher = TokenBatcher::try_new(TokenBatchingConfig::try_new(4, 16)?)?;
let dataset = batcher.batch_tokens(&tokens);
let val_dataset = dataset.clone();

let outcome = train_variant::<B>(
    &config, ModelVariant::DdlVector, &device,
    &dataset, Some(&val_dataset), &training,
)?;

// Inspect results
println!("Final loss: {:.4}", outcome.report.final_train.loss);
println!("Parameters: {}", outcome.report.num_params);

// Spectral diagnostics
if let Some(spectral) = &outcome.report.final_train_spectral {
    for (i, regime) in spectral.regime_per_layer.iter().enumerate() {
        println!("Layer {i}: beta={:.3} regime={:?}",
            spectral.beta_per_layer[i], regime);
    }
}

// Save and reload
save_training_artifact(&outcome, std::path::PathBuf::from("checkpoint/"))?;
Ok(())
}

Project Structure

src/
  delta_operator.rs     Core rank-1 spectral operator: x' = x - beta * k(k^T x)
  delta_res_block.rs    Residual update: x' = x + beta * k * (v - k^T x)
  beta_branch.rs        Learnable beta in (0, 2) via sigmoid scaling
  k_branch.rs           Unit-normalized projection direction
  v_branch.rs           Scalar and matrix target value branches
  transformer.rs        DDL transformer block and stack
  baseline.rs           Standard transformer (no DDL) for comparison
  variant.rs            6 model variants and config resolution
  spectral.rs           Regime classification, per-layer diagnostics, collectors
  training.rs           Training loop with cosine-warmup LR, spectral snapshots
  checkpoint.rs         Checkpoint save/load with manifest and optimizer state
  eval.rs               Variant evaluation and comparison ranking
  benchmark.rs          Timing, memory profiling, and regression gates
  generation.rs         Autoregressive generation with top-k/top-p sampling
  config.rs             Model configuration and validation
  data.rs               Token batching and dataset construction
  attention.rs          Multi-head attention with RoPE
  mlp.rs                SwiGLU feed-forward
  compressor.rs         Token-conv and channel-conv state compression
  bin/ddl/              CLI: train, compare, benchmark, generate
  bin/ddl-tui/          Interactive terminal UI (ratatui)
tests/                  Integration tests for all modules
examples/               End-to-end demos (train_tiny, compare_models, spectral_report, ...)
fixtures/               Benchmark regression baselines and sample data

Stack

Layer	Technology
Language	Rust (2024 edition)
ML framework	Burn 0.20
Backend	ndarray + autodiff (CPU)
Serialization	serde / serde_json
TUI	ratatui + crossterm (optional, `--features tui`)

Development

Validation commands that are expected to pass today:

cargo check
cargo test
cargo test --examples
cargo build --features tui
cargo clippy --all-targets --all-features -- -D warnings

Coverage is enforced in CI with cargo-llvm-cov over the library/test contract, excluding src/bin/ entrypoints that currently only have smoke coverage. Benchmark policy is contract-based: CI verifies benchmark report generation and regression-gate behavior, but does not yet enforce absolute latency budgets on hosted runners.

See CONTRIBUTING.md for the contributor workflow.

Known Limitations

CPU/ndarray backend only; no GPU path is enabled in the manifest.
Training and evaluation workflows assume local, in-memory token datasets rather than streaming corpora.
Benchmark memory numbers are estimates intended for regression tracking, not allocator-precise telemetry.
The crate is pre-1.0; public APIs may change as typed error handling and docs continue to harden.

Roadmap

Contract hardening: finish typed user-boundary errors and expand rustdoc coverage across the public API.
Research workflow depth: improve dataset ergonomics, artifact metadata, and richer diagnostics.
Backend expansion: evaluate GPU-enabled Burn backends only after CPU semantics remain stable.

Changelog

User-visible changes are tracked in CHANGELOG.md.

Help

File bugs and contract mismatches in the GitHub issue tracker: https://github.com/AbdelStark/ddl-rs/issues
Check the CI workflow contract in .github/workflows/ci.yml.

License

MIT

deep-delta-learning 0.1.1