irithyll 10.0.0

```
.__       .__  __  .__           .__  .__
│__│______│__│╱  │_│  │__ ___.__.│  │ │  │
│  ╲_  __ ╲  ╲   __╲  │  <   │  ││  │ │  │
│  ││  │ ╲╱  ││  │ │   Y  ╲___  ││  │_│  │__
│__││__│  │__││__│ │___│  ╱ ____││____╱____╱
                        ╲╱╲╱
```

[![Crates.io](https://img.shields.io/crates/v/irithyll.svg)](https://crates.io/crates/irithyll)
[![Documentation](https://docs.rs/irithyll/badge.svg)](https://docs.rs/irithyll)
[![CI](https://github.com/evilrat420/irithyll/actions/workflows/ci.yml/badge.svg)](https://github.com/evilrat420/irithyll/actions)
[![License](https://img.shields.io/crates/l/irithyll.svg)](https://github.com/evilrat420/irithyll)
[![MSRV](https://img.shields.io/badge/MSRV-1.75-blue.svg)](https://blog.rust-lang.org/2023/12/28/Rust-1.75.0.html)
[![GitHub stars](https://img.shields.io/github/stars/evilrat420/irithyll?style=social)](https://github.com/evilrat420/irithyll)

**Streaming machine learning in Rust.** Gradient-boosted trees, neural streaming architectures, kernel methods, linear models — all behind a single `StreamingLearner` trait, all learning one sample at a time, all running in O(1) memory.

---

## What it is

irithyll is a streaming ML library for the case where data arrives in order and never stops. There is no training set. There is no batch loop. Every sample updates the model and is then released — no buffer, no replay. The same idea ties together gradient-boosted trees, recurrent state-space models, kernel regression, attention, spiking networks, and every preprocessor that feeds them. They all wear the same two-method coat: `train_one(features, target, weight)` and `predict(features) -> f64`. A `Box<dyn StreamingLearner>` is a fully typed model.

The library is structured as two crates that share a vocabulary but not their constraints. The full crate (`irithyll`) does training, async ingestion, drift detection, AutoML — everything that benefits from `std`. The packed crate (`irithyll-core`) is `#![no_std]`, runs zero-allocation inference on bare metal, and serializes a trained tree as 12-byte nodes that traverse branch-free. Train on the cloud, export, run on a Cortex-M0+. The boundary is hard and tested against `thumbv6m-none-eabi`.

It is a deliberate library — every threshold derives from a paper, every neural readout is bounded before it touches the linear head, every config field round-trips through a builder that validates rather than accepts. Where the literature gives an option, the option becomes a feature flag, not a default. The aesthetic is a frozen city: cold, ordered, lit from inside.

The library serves four cases primarily: edge inference at sample rate, online forecasting under concept drift, embedded learning where the dataset would never fit in RAM, and research benches where a new streaming architecture lands beside `SGBT` and is held to the same throughput and accuracy floor.

## Quick Start

```sh
cargo add irithyll
```

Four snippets, in order of how a streaming pipeline grows.

**The smallest useful thing — normalize, boost, predict.**

```rust
use irithyll::{pipe, normalizer, sgbt, StreamingLearner};

let mut model = pipe(normalizer()).learner(sgbt(50, 0.01));
model.train(&[100.0, 0.5, 42.0], 3.14);
let pred = model.predict(&[100.0, 0.5, 42.0]);
```

**Race three model families against each other — let the data choose.**

```rust
use irithyll::{automl::{AutoTuner, Factory}, StreamingLearner};

let mut tuner = AutoTuner::builder()
    .add_factory(Factory::sgbt(5))
    .add_factory(Factory::mamba(5))
    .add_factory(Factory::esn())
    .use_drift_rerace(true)
    .build();

tuner.train(&[1.0, 2.0, 3.0], 6.0);
let pred = tuner.predict(&[1.0, 2.0, 3.0]);
```

**Mix architectures inside a single mixture-of-experts — heterogeneous experts welcome.**

```rust
use irithyll::{moe::NeuralMoE, sgbt, esn, StreamingLearner};

let mut moe = NeuralMoE::builder()
    .expert(sgbt(50, 0.01))
    .expert_with_warmup(esn(100, 0.9), 50)
    .top_k(2)
    .build();
```

**Turn any regressor into a classifier — `binary_classifier` and `multiclass_classifier` wrap a `StreamingLearner` with bipolar one-vs-rest heads.**

```rust
use irithyll::{sgbt, binary_classifier, StreamingLearner};

let mut clf = binary_classifier(sgbt(50, 0.05));
clf.train(&[1.5, -0.3, 2.1], 1.0);            // labels are 0.0 / 1.0
let prob_positive = clf.predict(&[1.5, -0.3, 2.1]);
```

Composition is the point. Anything that implements `StreamingLearner` slots into a pipeline, an MoE expert, an AutoML candidate, a projection wrapper, or a classification head. The trait is the contract; the rest is LEGO.

For the longer ergonomics story — pipeline composition, AutoML tournaments, drift wiring, embedded deployment — see [`docs/USAGE.md`](docs/USAGE.md).

## Design Principles

The library has opinions. They are stable across releases and they shape every model.

**One sample at a time, every time.** No mini-batches hidden inside `train_one`. No "warm up the optimizer with a buffer first". Streaming-only models stay streaming. Architectures that originally required offline training (TTT, KAN, Mamba) are reimplemented with online updates that converge sample-by-sample — and tested for it.

**O(1) memory per model.** State size is a function of the model, not the data seen. A model that has trained on a billion samples occupies the same memory as one that has trained on a thousand. Drift detectors are bounded ring buffers; histograms have fixed bin counts; subspace trackers carry rank-`k` projections, not covariance matrices.

**Bounded readouts before linear heads.** Every neural model that feeds a recursive least squares head bounds its features first — `tanh`, `sigmoid`, L2-normalize, clamp. Unbounded features explode the RLS weights silently. This is non-negotiable; new neural architectures land with the bounding step or they don't land.

**Constants come from theory, not from grid search.** Bernstein bounds for promotion tests, the Hoeffding inequality for split decisions, the PAST update for streaming PCA. Where a paper gives a constant, the constant cites the paper. Where it doesn't, the library prefers a self-calibrating online statistic over a magic number.

**Validation is a builder's job.** Every public `Config` carries a `Builder` that returns `Result<_, ConfigError>`. Bounds are checked before the model is constructed; impossible configurations don't get the chance to misbehave.

**Forbid unsafe in the main crate.** `irithyll` has `#![forbid(unsafe_code)]` at its root — the entire training-side surface is safe Rust. `irithyll-core` has localized `unsafe` for two earned reasons: zero-copy view parsing of the packed binary format and AVX2 SIMD intrinsics behind the `simd-avx2` feature. Each block carries a safety comment that names its precondition; nothing else is `unsafe`.

## Workspace

| Crate | What it does | `no_std` |
|-------|--------------|----------|
| **`irithyll`** | Training, streaming algorithms, pipelines, async I/O, AutoML | No |
| **`irithyll-core`** | Packed inference engine — 12-byte nodes, branch-free traversal, zero-alloc | Yes |
| **`irithyll-python`** | PyO3 bindings — `AutoTuner`, `ProjectedLearner`, factory variants | No |

`irithyll-core` cross-compiles for bare-metal targets — `thumbv6m-none-eabi` (Cortex-M0+), `thumbv7m-none-eabi` (M3), and `thumbv7em-none-eabi` (M4) all green in CI. Its only dependency is `libm` for soft-float math; everything else (SIMD, parallel, serde) is opt-in. Train with the full crate, export to packed format, run inference on the microcontroller — same predictions, no surprises.

## Models

irithyll's model lineup spans four tiers. Production models are the ones you reach for first: streaming gradient-boosted trees with drift-driven tree replacement, recursive least squares with confidence intervals, kernel RLS, Mondrian forests, classical baselines. The neural tier is where the library has spent most of its recent design budget — selective state-space models, test-time-trained recurrent networks, Kolmogorov-Arnold networks, spiking networks, and a streaming linear-attention layer that exposes twelve distinct attention modes (RetNet, Hawk/Griffin, GLA, GLAVector, DeltaNet, GatedDeltaNet, RWKV, RWKV-7, mLSTM, DeltaProduct, HGRN2, log-linear). Specialized tools cover conformal prediction, anomaly detection, online projection learning, packed inference, and TreeSHAP. Ensembles compose all of the above.

Every algorithm implements [`StreamingLearner`](https://docs.rs/irithyll/latest/irithyll/trait.StreamingLearner.html). Every neural model is online-trainable end-to-end — no offline pretraining required. None of the readouts are unbounded; every feature feeding a recursive least squares head is squashed, normalized, or clamped, because that is the difference between a streaming model and one that diverges quietly on the first heavy-tailed sample.

| Tier | What it contains |
|------|------------------|
| **Production** | SGBT family (`SGBT`, `DistributionalSGBT`, `BaggedSGBT`, `MulticlassSGBT`, `ParallelSGBT`), `RecursiveLeastSquares`, `KRLS`, Mondrian forests, Hoeffding trees, Gaussian Naive Bayes, linear / polynomial models |
| **Neural** | Mamba family (V1 / V3 / Mamba-3), Echo State Networks, Next-Gen Reservoir Computing, `StreamingTTT`, `StreamingKAN` / T-KAN, AGMP, mGRADE, HGRN2, sLSTM, SpikeNet (e-prop + surrogate gradients), `StreamingAttentionModel` (12 modes) |
| **Specialized** | Packed inference (`irithyll-core`), conformal prediction with PID control, anomaly detection, `ProjectedLearner` (online subspace tracking via PAST), TreeSHAP |
| **Ensemble** | `NeuralMoE` (heterogeneous experts, top-k routing, drift-aware), streaming AutoML (`AutoTuner`, tournament racing, drift re-racing, complexity-adjusted elimination) |

Classification works on top of regression: `binary_classifier(model)` and `multiclass_classifier(model, k)` wrap any `StreamingLearner` with bipolar one-vs-rest heads.

For per-model architecture, paper citations, when-to-use guidance, math summaries, and config references, see [`MODELS.md`](MODELS.md).

## Drift Handling

The world distribution shifts; streaming models that don't notice are streaming models that lie. irithyll treats drift as a first-class signal, not a recovery story.

Three detectors ship in `irithyll::drift`: **ADWIN** (Bifet & Gavaldà 2007) for adaptive windowing, **DDM** (Gama et al. 2004) for the warning-and-drift two-stage state machine, and **Page-Hinkley** for cumulative-deviation tests. They expose a single `update(error) -> DriftState` interface, plug into any model that takes a `Box<dyn DriftDetector>`, and respond to `adjust_config()` calls when AutoML wants to widen the learning rate during a re-race.

Inside SGBT, drift drives **tree replacement**: each boosting stage carries a detector watching its standardized residual; when drift fires, that stage's tree is replaced with a fresh alternate that warms up in parallel before promotion. The ensemble keeps predicting throughout — there is no rebuild pause.

Inside AutoML, drift drives **re-racing**: the `AutoTuner` re-evaluates challenger configurations against the champion when the residual distribution shifts, with the comparison gated by an empirical Bernstein promotion test (`bernstein_promotion_test` in `automl::racing`) so the champion never flips on noise.

## Bare-Metal Deployment

The packed inference path is a deliberate boundary: train with the full crate, export to a 12-byte-per-node binary representation, deserialize on a microcontroller in pure `#![no_std]` (no allocator required), and predict.

```rust
// On the host: train, then export to packed bytes.
use irithyll::{SGBT, SGBTConfig, StreamingLearner};
use irithyll::export_embedded::export_packed;

let mut model = SGBT::new(SGBTConfig::builder().n_steps(50).build().unwrap());
// ... train on a stream ...
let packed_bytes: Vec<u8> = export_packed(&model, /* n_features */ 4);
// Write to flash, ship to device.
```

```rust
// On the device: zero-copy view over the bytes. No std, no allocation in predict.
#![no_std]
use irithyll_core::EnsembleView;

let view = EnsembleView::from_bytes(PACKED_BYTES).unwrap();
let prediction: f32 = view.predict(&[0.5, 1.2, -0.3, 0.1]);
```

Validation happens once in `from_bytes` (magic bytes, child-index bounds, feature-index bounds); after that, prediction is pure pointer arithmetic. Five nodes fit per 64-byte cache line, learning rate is baked into leaf values at export time, and an 8-byte int16-quantized variant (`export_packed_i16` + `QuantizedEnsembleView`) eliminates floats from the inference hot loop entirely. The crate's only dependency is `libm`. CI cross-compiles for all three Cortex-M targets on every commit.

## Feature Flags

`irithyll-core`'s default build is pure `no_std` — no allocator, no `std`, just `libm` for soft-float math. Opt-in features (`alloc`, `std`, `serde`, `simd`, `simd-avx2`, `parallel`) extend it as needed; the device-side inference path on the previous page runs in the strictest mode. Neural streaming modules in the main crate compile unconditionally — no flag required.

| Feature | Default | Description |
|---------|---------|-------------|
| `serde-json` | Yes | JSON model serialization |
| `serde-bincode` | No | Compact binary serialization |
| `parallel` | No | Rayon-based parallel tree training (`ParallelSGBT`) |
| `simd` | No | Generic SIMD acceleration |
| `simd-avx2` | No | AVX2 histogram + neural ops (x86_64 only) |
| `kmeans-binning` | No | K-means histogram binning strategy |
| `arrow` | No | Apache Arrow `RecordBatch` integration |
| `parquet` | No | Parquet file I/O |
| `onnx` | No | ONNX model export |
| `neural-leaves` | No | Experimental MLP leaf models |
| `full` | No | Everything above |

## TUI

irithyll ships a terminal dashboard for live monitoring of streaming model state, prequential metrics, and drift events. Mins, maxes, percentile envelopes, drift markers, AutoML leaderboards — rendered with `ratatui`, refreshed at the rate the model trains. It is the cheapest way to feel whether your model is learning.

![irithyll TUI](docs/images/tui.png)

![irithyll TUI demo](docs/images/tui.gif)

*Throttled for demo.*

```bash
# Multi-family demo on a built-in regression benchmark.
irithyll                                     # SGBT on Friedman
irithyll --family kan --bench mackey-glass   # KAN on Mackey-Glass chaos
irithyll --family mamba --bench lorenz       # Mamba on the Lorenz attractor

# Train your own CSV with the live dashboard. Any of the eight supported
# families works the same way — swap --model-type to switch.
irithyll train data.csv --tui --model-type sgbt
irithyll train data.csv --tui --model-type kan
irithyll eval data.csv  --tui --model-type mamba
```

Built-in benchmarks: `friedman`, `lorenz`, `mackey-glass`, `periodic`, `mqar`, `needle`. Supported families for `--tui`: `sgbt`, `mamba`, `ttt`, `kan`, `esn`, `ngrc`, `spike-net`. Per-feature importance ships for SGBT, KAN, and Linear; the reservoir/SSM/spiking families show a "not exposed" placeholder in the importances tab.

## References

The implementations cite their sources. The list below is the load-bearing core — papers whose math directly shapes a model in irithyll. The complete bibliography (foundations, related work, surveys) lives in [`REFERENCES.md`](REFERENCES.md).

**Streaming Boosting and Trees**

- Gunasekara, Pfahringer, Gomes, Bifet (2024). *Gradient boosted trees for evolving data streams.* Machine Learning, 113, 3325-3352.
- Domingos, Hulten (2000). *Mining high-speed data streams.* KDD 2000. — Hoeffding bound for online splits.
- Bifet, Gavaldà (2007). *Learning from time-changing data with adaptive windowing.* SIAM SDM 2007. — ADWIN.
- Lundberg et al. (2020). *From local explanations to global understanding with explainable AI for trees.* Nature Machine Intelligence, 2, 56-67. — TreeSHAP.

**State-Space Models and Recurrent Networks**

- Gu, Dao (2023). *Mamba: Linear-time sequence modeling with selective state spaces.* arXiv:2312.00752.
- Dao, Gu (2024). *Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality.* arXiv:2405.21060.
- Gu, Gupta, Goel, Ré (2022). *On the parameterization and initialization of diagonal state space models.* NeurIPS 2022. — S4D-Inv.
- Beck et al. (2024). *xLSTM: Extended long short-term memory.* NeurIPS 2024. — mLSTM / sLSTM.

**Streaming Linear Attention**

- Yang, Wang, Shen, Panda, Kim (2023). *Gated linear attention transformers with hardware-efficient training.* arXiv:2312.06635. — GLA.
- Yang et al. (2024). *Gated Delta Networks: Improving Mamba2 with Delta Rule.* arXiv:2412.06464. — DeltaNet / GatedDeltaNet.
- Sun et al. (2023). *Retentive network: A successor to transformer for large language models.* arXiv:2307.08621. — RetNet.
- De et al. (2024). *Griffin: Mixing gated linear recurrences with local attention.* arXiv:2402.19427. — Hawk.
- Peng et al. (2024). *Eagle and Finch: RWKV with matrix-valued states and dynamic recurrence.* arXiv:2404.05892. — RWKV.

**Test-Time Training, KAN, Reservoir, Spiking**

- Sun et al. (2024). *Learning to (Learn at Test Time): RNNs with expressive hidden states.* ICML 2025. — StreamingTTT.
- Behrouz, Zhong, Mirrokni (2025). *Titans: Learning to memorize at test time.* arXiv:2501.00663. — momentum + weight-decay TTT.
- Liu et al. (2024). *KAN: Kolmogorov-Arnold Networks.* ICLR 2025.
- Hoang et al. (2026). *Ultrafast on-chip online learning via Kolmogorov-Arnold Networks.* arXiv:2602.02056. — streaming convergence.
- Gauthier, Bollt, Griffith, Barbosa (2021). *Next generation reservoir computing.* Nature Communications, 12, 5564.
- Rodan, Tiňo (2010). *Minimum complexity echo state network.* IEEE TNN, 23(1).
- Bellec et al. (2020). *A solution to the learning dilemma for recurrent networks of spiking neurons.* Nature Communications, 11, 3625. — e-prop.

**Mixture-of-Experts and AutoML**

- Shazeer et al. (2017). *Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.* ICLR 2017.
- Aspis et al. (2025). *DriftMoE: Mixture of experts for streaming classification with concept drift.* ECMLPKDD 2025.
- Wu, Iyer, Wang (2021). *ChaCha for online AutoML.* ICML 2021.
- Qi et al. (2023). *Discounted Thompson Sampling for non-stationary bandits.* arXiv:2305.10718.

**Continual Learning, Conformal, Projection**

- Dohare et al. (2024). *Loss of plasticity in deep continual learning.* Nature, 632, 768-774.
- Kirkpatrick et al. (2017). *Overcoming catastrophic forgetting in neural networks.* PNAS, 114(13). — EWC.
- Angelopoulos, Candes, Tibshirani (2023). *Conformal PID control for time series prediction.* NeurIPS 2023.
- Yang (1995). *Projection approximation subspace tracking.* IEEE TSP, 43(1). — PAST.

## Further Reading

| Document | Contents |
|----------|----------|
| [`MODELS.md`](MODELS.md) | Per-model architecture, paper citation, when-to-use, math summary, config reference |
| [`docs/USAGE.md`](docs/USAGE.md) | Extended ergonomics — pipelines, AutoML, MoE composition, embedded deployment |
| [`BENCHMARKS.md`](BENCHMARKS.md) | Benchmark methodology, datasets, throughput numbers, Pareto plots |
| [`REFERENCES.md`](REFERENCES.md) | Complete bibliography, organized by tier |
| [`examples/`](examples/) | Runnable examples, organized `01_quickstart` → `02_essentials` → `03_neural` → `04_advanced` |
| [`CHANGELOG.md`](CHANGELOG.md) | Release history |
| [`CONTRIBUTING.md`](CONTRIBUTING.md) | Contribution guide and code standards |
| [docs.rs](https://docs.rs/irithyll) | Full API reference |

## License

Licensed under either of

- Apache License, Version 2.0 ([LICENSE-APACHE](LICENSE-APACHE) or <http://www.apache.org/licenses/LICENSE-2.0>)
- MIT License ([LICENSE-MIT](LICENSE-MIT) or <http://opensource.org/licenses/MIT>)

at your option.

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in this work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

**MSRV:** 1.75. Checked in CI; raised only in minor version bumps.