deep-delta-learning
A Rust implementation of Deep Delta Learning (DDL) on Burn — rank-deficient residual operators with learnable spectral properties for transformer language models.
x' = x + k * beta * (v - k^T x)
\_________________________/
correction toward target v
scaled by projection direction k
and learnable strength beta in (0, 2)
DDL replaces the standard residual connection x' = x + f(x) with a geometrically-motivated update that projects the hidden state toward a learned target along a learned direction. The scalar beta controls the spectral regime of each layer — from near-identity (beta -> 0) through projection (beta -> 1) to reflection (beta -> 2) — giving the model a continuous knob between "change nothing" and "invert the residual stream."
Status
As of 2026-03-17, deep-delta-learning is alpha. It is suitable for CPU-based research
prototypes, CLI experiments, checkpoint roundtrips, and benchmark/report generation. It is not
yet suitable for production inference services, large-scale training, or a semver-stable public
API. Known limitations include CPU/ndarray-only execution, in-memory toy dataset workflows, and
expected API churn before 1.0. No specific breaking change is scheduled next, but pre-1.0 API
cleanup should be expected.
See CHANGELOG.md for user-visible changes and docs/ARCHITECTURE.md for the current system/runbook view.
Installation
Requires Rust 2024 edition. CPU/ndarray backend only (no GPU required).
30-Second Demo
# Train all 6 variants, compare, and benchmark — one command
# Interactive TUI explorer (generates demo data if run with no args)
The Delta Operator
The core primitive is a rank-1 spectral update. Given hidden state x in R^d, direction k in R^d (unit-normalized), target scalar v in R, and mixing coefficient beta in (0, 2):
Vector form (d_value = 1)
─────────────────────────
proj = k^T x
x' = x + beta * k * (v - proj)
Matrix form (d_value > 1)
─────────────────────────
P = k x^T [d, d_value]
x' = x + beta * k * (V - P)
Spectral properties of the induced linear map A = I - beta * k k^T:
| Property | Formula | Interpretation |
|---|---|---|
| k-eigenvalue | 1 - beta |
Scaling along projection direction |
| Orthogonal eigenvalues | 1 |
Unchanged in the nullspace of k |
| Determinant | 1 - beta |
Volume scaling (vector state) |
| Lifted determinant | (1 - beta)^d_value |
Volume scaling (matrix state) |
The eigenvalue 1 - beta defines five spectral regimes:
beta: 0.0 0.3 0.7 1.3 1.7 2.0
|------|--------|--------|--------|--------|
Near Inter- Near Near Strong
Identity polating Projection Reflection Reflection
"keep x" "project" "invert"
Architecture
A DDL transformer block replaces the standard x + Attn(x) and x + MLP(x) with:
┌─────────────────────┐
│ Input x │
└──────────┬──────────┘
│
┌──────────▼──────────┐
│ RMSNorm + MHA │──── backbone_out
└──────────┬──────────┘
│
┌──────────────────┼──────────────────┐
│ │ │
┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
│ k-branch │ │ v-branch │ │ beta-branch │
│ k = f(out) │ │ v = g(out) │ │ beta = h(x) │
│ ||k|| = 1 │ │ │ │ 0 < beta < 2│
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
└─────────┬───────┘ │
│ │
┌───────▼──────────────────────────▼───┐
│ x' = x + beta * k * (v - k^T x) │
│ delta residual update │
└──────────────┬───────────────────────┘
│
┌──────────▼──────────┐
│ RMSNorm + SwiGLU │──── repeat for MLP sublayer
└──────────┬──────────┘
│
┌──────────▼──────────┐
│ Output x' │
└─────────────────────┘
Each transformer block applies the delta residual update twice — once after attention, once after the MLP — giving 2 * num_layers total beta/k/v branches.
Model Variants
Six variants explore the design space along three axes: state rank (scalar vs matrix), compression (token-conv vs channel-conv), and embedding expansion.
| Variant | CLI slug | d_value | Compression | Embed Conv | Description |
|---|---|---|---|---|---|
| Baseline | baseline |
— | — | — | Standard transformer (no DDL) |
| DDL Vector | ddl |
1 | TokenConv | No | Rank-1 delta operator. Cheapest DDL variant. |
| DDL Matrix | ddl-tokenconv |
d_value | TokenConv | No | Matrix-state delta with token compression. |
| DDL Matrix+EC | ddl-ec |
d_value | TokenConv | Yes | + embedding expansion convolution. |
| DDL Channel | ddl-cc |
d_value | ChannelConv | No | Matrix-state with channel compression. |
| DDL Channel+EC | ddl-cc-ec |
d_value | ChannelConv | Yes | Full variant: channel compression + embed conv. |
Baseline uses a standard residual x + f(x). All DDL variants replace it with the delta update. The matrix variants (d_value > 1) maintain a [batch, seq, d_model, d_value] state tensor, giving each position a richer internal representation at the cost of O(d_value) more computation per layer.
CLI Reference
train
Train one or more variants and save checkpoints + reports.
Outputs per-variant checkpoint directories and training_comparison.json with loss curves, spectral snapshots at each eval interval, and variant rankings.
--resume <dir> continues training from a saved single-variant checkpoint.
compare
Evaluate saved checkpoints on a dataset.
benchmark
Measure operator, block, and full-model latency/memory.
# Compare against a saved baseline with regression gates
Suites: operator, block, compressor, normalization, model (default: all).
generate
Autoregressive token generation from a checkpoint.
TUI Explorer
An interactive terminal UI for visualizing training and benchmark results.
# Auto-generates demo data — zero config needed
# Or load your own reports
Navigate with 1-4 / Tab to switch views, j/k to select variants, ? for help, q to quit.
Architecture & Operations
The short version:
ddlis the shipped automation surface for train/compare/benchmark/generate.- checkpoints are explicit directories with manifest, weights, report, optimizer state, and
optional
best_validation/state - invalid evaluation inputs are expected to fail with typed/library or CLI errors rather than degrade into placeholder metrics
For the full component map, invariants, validation commands, and failure-mode runbook, see docs/ARCHITECTURE.md.
Configuration Reference
DdlConfig
| Field | Default | Description |
|---|---|---|
d_model |
required | Hidden dimension. Must be divisible by num_heads. |
num_layers |
required | Number of transformer blocks. |
num_heads |
required | Number of attention heads. |
vocab_size |
required | Token vocabulary size. |
d_value |
1 | Delta state rank. 1 = vector (rank-1). >1 = matrix state. |
head_dim |
d_model/heads | Per-head dimension. |
mlp_hidden |
ceil(8d/3) | SwiGLU intermediate dimension. |
mapping |
KMap | Branch mapping: KMap or VMap. |
beta_init |
0.5 | Initial beta value. Must be in (0, 2). |
beta_single_linear |
true | Single-linear beta gate. false = two-layer. |
k_eps |
1e-6 | k-direction normalization epsilon. |
compression |
TokenConv | TokenConv or ChannelConv. |
shortconv_kernel_size |
4 | Causal depthwise conv kernel size. |
embed_conv |
false | Enable embedding expansion convolution. |
max_seq_len |
128 | Maximum sequence length (RoPE). |
rope_theta |
10000.0 | RoPE frequency base. |
TrainingConfig
| Field | Default | Description |
|---|---|---|
max_steps |
32 | Total training steps per phase. |
eval_interval |
4 | Steps between validation + spectral snapshots. |
learning_rate |
1e-3 | Peak learning rate. |
min_learning_rate |
0.0 | Cosine decay floor. |
warmup_steps |
0 | Linear warmup steps. |
weight_decay |
0.1 | AdamW weight decay. |
beta1 / beta2 |
0.9 / 0.95 | AdamW momentum coefficients. |
epsilon |
1e-5 | AdamW epsilon. |
grad_clip |
1.0 | Gradient norm clipping threshold. |
Library Usage
use ;
use *;
Project Structure
src/
delta_operator.rs Core rank-1 spectral operator: x' = x - beta * k(k^T x)
delta_res_block.rs Residual update: x' = x + beta * k * (v - k^T x)
beta_branch.rs Learnable beta in (0, 2) via sigmoid scaling
k_branch.rs Unit-normalized projection direction
v_branch.rs Scalar and matrix target value branches
transformer.rs DDL transformer block and stack
baseline.rs Standard transformer (no DDL) for comparison
variant.rs 6 model variants and config resolution
spectral.rs Regime classification, per-layer diagnostics, collectors
training.rs Training loop with cosine-warmup LR, spectral snapshots
checkpoint.rs Checkpoint save/load with manifest and optimizer state
eval.rs Variant evaluation and comparison ranking
benchmark.rs Timing, memory profiling, and regression gates
generation.rs Autoregressive generation with top-k/top-p sampling
config.rs Model configuration and validation
data.rs Token batching and dataset construction
attention.rs Multi-head attention with RoPE
mlp.rs SwiGLU feed-forward
compressor.rs Token-conv and channel-conv state compression
bin/ddl/ CLI: train, compare, benchmark, generate
bin/ddl-tui/ Interactive terminal UI (ratatui)
tests/ Integration tests for all modules
examples/ End-to-end demos (train_tiny, compare_models, spectral_report, ...)
fixtures/ Benchmark regression baselines and sample data
Stack
| Layer | Technology |
|---|---|
| Language | Rust (2024 edition) |
| ML framework | Burn 0.20 |
| Backend | ndarray + autodiff (CPU) |
| Serialization | serde / serde_json |
| TUI | ratatui + crossterm (optional, --features tui) |
Development
Validation commands that are expected to pass today:
Coverage is enforced in CI with cargo-llvm-cov over the library/test contract, excluding
src/bin/ entrypoints that currently only have smoke coverage. Benchmark policy is contract-based:
CI verifies benchmark report generation and regression-gate behavior, but does not yet enforce
absolute latency budgets on hosted runners.
See CONTRIBUTING.md for the contributor workflow.
Known Limitations
- CPU/ndarray backend only; no GPU path is enabled in the manifest.
- Training and evaluation workflows assume local, in-memory token datasets rather than streaming corpora.
- Benchmark memory numbers are estimates intended for regression tracking, not allocator-precise telemetry.
- The crate is pre-1.0; public APIs may change as typed error handling and docs continue to harden.
Roadmap
- Contract hardening: finish typed user-boundary errors and expand rustdoc coverage across the public API.
- Research workflow depth: improve dataset ergonomics, artifact metadata, and richer diagnostics.
- Backend expansion: evaluate GPU-enabled Burn backends only after CPU semantics remain stable.
Changelog
User-visible changes are tracked in CHANGELOG.md.
Help
- File bugs and contract mismatches in the GitHub issue tracker:
https://github.com/AbdelStark/ddl-rs/issues - Check the CI workflow contract in .github/workflows/ci.yml.