irithyll 5.0.0

Streaming Gradient Boosted Trees for evolving data streams
Documentation

irithyll

Crates.io Documentation CI License MSRV

Streaming Gradient Boosted Trees for evolving data streams.

Irithyll is a pure Rust implementation of the SGBT algorithm (Gunasekara et al., 2024). It learns one sample at a time. No batches, no windows, no retraining. Each tree in the ensemble uses Hoeffding-bound split decisions to grow incrementally, and when the data distribution shifts, concept drift detectors trigger automatic tree replacement so the model stays current.

The paper laid the foundation, but deploying streaming trees in long-running systems required going further. Irithyll adds EWMA leaf decay for continuous forgetting, lazy O(1) histogram decay (because decaying every bin on every sample doesn't scale), proactive tree replacement on a timer, and EFDT-style split re-evaluation at max-depth leaves. Together these close the gap between the research algorithm and a system you can run indefinitely on non-stationary data.

Features

Core Algorithm

  • True online learning with train_one(), one sample at a time
  • Concept drift detection via Page-Hinkley, ADWIN, or DDM, with automatic tree replacement
  • Multi-class support through MulticlassSGBT with one-vs-rest committees
  • Multi-target regression via MultiTargetSGBT with T independent models
  • Three SGBT variants from the paper: Standard, Skip (SGBT-SK), and MultipleIterations (SGBT-MI)
  • Pluggable loss functions: squared, logistic, softmax, Huber, or implement the Loss trait yourself
  • Hoeffding tree splitting with configurable confidence bounds
  • XGBoost-style regularization: L2 (lambda) and minimum gain (gamma)

Interpretability

  • TreeSHAP explanations via explain() with path-dependent SHAP values (Lundberg et al., 2020)
  • Named features with explain_named() for human-readable per-feature contributions
  • StreamingShap for online running-mean |SHAP| feature importance without storing past data
  • Feature importance from accumulated split gain across the ensemble

Streaming Adaptation

These go beyond the original paper to handle the realities of long-running, non-stationary systems:

  • EWMA leaf decay (leaf_half_life): exponential moving average on leaf statistics so the model gradually forgets old data without needing to replace entire trees
  • Lazy histogram decay: the decay math is O(1) per sample instead of O(n_bins), with exact results. The trick is storing samples in un-decayed coordinates and only materializing the decay when bins are actually read at split evaluation time
  • Proactive tree replacement (max_tree_samples): cycle trees on a timer, independent of drift detectors. Useful when drift is gradual and detectors don't fire
  • Split re-evaluation (split_reeval_interval): EFDT-inspired re-checking of max-depth leaves to see if splitting would now help

Production Infrastructure

  • Async streaming via AsyncSGBT with tokio channels, concurrent Predictor handles, and backpressure
  • Model checkpointing with save_model() / load_model() — drift detector state is preserved across save/load
  • Online metrics: incremental MAE, MSE, RMSE, R-squared, accuracy, precision, recall, F1, log loss
  • Deterministic seeding for reproducible results
  • Python bindings via the irithyll-python crate (PyO3 + numpy, GIL-released train/predict)

Optional Accelerators

  • Parallel training (parallel): Rayon-based data-parallel tree training
  • SIMD histograms (simd): AVX2 intrinsics for histogram gradient summation
  • Arrow integration (arrow): train from RecordBatch, predict to arrays
  • Parquet I/O (parquet): bulk training directly from Parquet files
  • ONNX export (onnx): export trained models for cross-platform inference

Quick Start

cargo add irithyll

Regression

use irithyll::{SGBTConfig, SGBT, Sample};

fn main() {
    let config = SGBTConfig::builder()
        .n_steps(50)
        .learning_rate(0.1)
        .build()
        .expect("valid config");

    let mut model = SGBT::new(config);

    // Stream samples one at a time
    for i in 0..500 {
        let x = i as f64 * 0.01;
        let target = 2.0 * x + 1.0;
        model.train_one(&Sample::new(vec![x], target));
    }

    let prediction = model.predict(&[3.0]);
    println!("predict(3.0) = {:.4}", prediction);
}

Binary Classification

use irithyll::{SGBTConfig, SGBT, Sample};
use irithyll::loss::logistic::LogisticLoss;

fn main() {
    let config = SGBTConfig::builder()
        .n_steps(30)
        .learning_rate(0.1)
        .build()
        .expect("valid config");

    let mut model = SGBT::with_loss(config, LogisticLoss);

    // Class 0 near (-2, -2), class 1 near (2, 2)
    for _ in 0..500 {
        model.train_one(&Sample::new(vec![-2.0, -2.0], 0.0));
        model.train_one(&Sample::new(vec![ 2.0,  2.0], 1.0));
    }

    let prob = model.predict_proba(&[1.5, 1.5]);
    println!("P(class=1 | [1.5, 1.5]) = {:.4}", prob);
}

Explanations

use irithyll::{SGBTConfig, SGBT, Sample};

fn main() {
    let config = SGBTConfig::builder()
        .n_steps(20)
        .learning_rate(0.1)
        .feature_names(vec!["price".into(), "volume".into()])
        .build()
        .expect("valid config");

    let mut model = SGBT::new(config);

    for i in 0..500 {
        let price = i as f64 * 0.1;
        let volume = 100.0 - price;
        model.train_one(&Sample::new(vec![price, volume], price * 2.0));
    }

    // TreeSHAP: per-feature contributions
    let shap = model.explain(&[5.0, 50.0]);
    // Invariant: shap.base_value + sum(shap.values) == model.predict(&[5.0, 50.0])

    // Named explanations (sorted by |contribution|)
    if let Some(named) = model.explain_named(&[5.0, 50.0]) {
        for (name, value) in &named.values {
            println!("{}: {:.4}", name, value);
        }
    }
}

Multi-Target Regression

use irithyll::{SGBTConfig, MultiTargetSGBT};

fn main() {
    let config = SGBTConfig::builder()
        .n_steps(20)
        .learning_rate(0.1)
        .build()
        .expect("valid config");

    let mut model = MultiTargetSGBT::new(config, 3).unwrap();

    for i in 0..500 {
        let x = i as f64 * 0.01;
        model.train_one(&[x, x * 2.0], &[x * 3.0, x * -1.0, x + 1.0]);
    }

    let preds = model.predict(&[1.0, 2.0]);
    assert_eq!(preds.len(), 3);
}

Async Streaming

use irithyll::{SGBTConfig, Sample};
use irithyll::stream::AsyncSGBT;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let config = SGBTConfig::builder()
        .n_steps(30)
        .learning_rate(0.1)
        .build()?;

    let mut runner = AsyncSGBT::new(config);
    let sender = runner.sender();
    let predictor = runner.predictor();

    // Spawn the training loop
    let handle = tokio::spawn(async move { runner.run().await });

    // Feed samples from any async context
    for i in 0..500 {
        let x = i as f64 * 0.01;
        sender.send(Sample::new(vec![x], 2.0 * x)).await?;
    }

    // Predict concurrently while training proceeds
    let pred = predictor.predict(&[3.0]);
    println!("prediction = {:.4}", pred);

    // Drop sender to signal shutdown
    drop(sender);
    handle.await??;
    Ok(())
}

Python

import numpy as np
from irithyll_python import StreamingGBTConfig, StreamingGBT

config = StreamingGBTConfig().n_steps(50).learning_rate(0.1)
model = StreamingGBT(config)

for i in range(500):
    x = np.array([i * 0.01])
    model.train_one(x, 2.0 * x[0] + 1.0)

pred = model.predict(np.array([3.0]))

# SHAP explanations
shap = model.explain(np.array([3.0]))
print(shap.values, shap.base_value)

# Save/load
model.save("model.json")
restored = StreamingGBT.load("model.json")

Architecture

irithyll/
  loss/          Differentiable loss functions (squared, logistic, softmax, huber)
  histogram/     Streaming histogram binning (uniform, quantile, optional k-means)
  tree/          Hoeffding-bound streaming decision trees
  drift/         Concept drift detectors (Page-Hinkley, ADWIN, DDM) with serializable state
  ensemble/      SGBT boosting loop, config, variants, multi-class, multi-target, parallel
  explain/       TreeSHAP explanations and StreamingShap online importance
  stream/        Async tokio channel-based training runner and predictor handles
  metrics/       Online regression and classification metric trackers
  serde_support/ Model checkpoint/restore serialization

irithyll-python/   PyO3 Python bindings (StreamingGBT, MultiTargetGBT, ShapExplanation)

Configuration

All hyperparameters go through the builder pattern, validated on build():

use irithyll::SGBTConfig;

let config = SGBTConfig::builder()
    .n_steps(100)              // Number of boosting steps (trees)
    .learning_rate(0.0125)     // Shrinkage factor
    .feature_subsample_rate(0.75) // Fraction of features per tree
    .max_depth(6)              // Maximum tree depth
    .n_bins(64)                // Histogram bins per feature
    .lambda(1.0)               // L2 regularization
    .gamma(0.0)                // Minimum split gain
    .grace_period(200)         // Samples before evaluating splits
    .delta(1e-7)               // Hoeffding bound confidence
    .feature_names(vec!["price".into(), "volume".into()])
    .build()
    .expect("valid config");
Parameter Default Description
n_steps 100 Number of boosting steps (trees in ensemble)
learning_rate 0.0125 Shrinkage factor applied to each tree output
feature_subsample_rate 0.75 Fraction of features sampled per tree
max_depth 6 Maximum depth of each streaming tree
n_bins 64 Number of histogram bins per feature
lambda 1.0 L2 regularization on leaf weights
gamma 0.0 Minimum gain required to make a split
grace_period 200 Minimum samples before evaluating splits
delta 1e-7 Hoeffding bound confidence parameter
drift_detector PageHinkley(0.005, 50.0) Drift detection algorithm for tree replacement
variant Standard Computational variant (Standard, Skip, MI)
feature_names None Optional feature names for named explanations
leaf_half_life None (disabled) EWMA decay half-life for leaf statistics
max_tree_samples None (disabled) Proactive tree replacement threshold
split_reeval_interval None (disabled) Re-evaluation interval for max-depth leaves

Feature Flags

Feature Dependencies Description
serde-json (default) serde_json JSON model serialization
serde-bincode bincode Compact binary serialization
parallel rayon Parallel tree training
simd -- AVX2 histogram acceleration
kmeans-binning -- K-means histogram binning
arrow arrow Apache Arrow integration
parquet parquet Parquet file I/O
onnx prost ONNX model export
neural-leaves -- Experimental MLP leaf models
full all above Enable everything

Examples

Run any example with cargo run --example <name>:

Example Description
basic_regression Linear regression with RMSE tracking
classification Binary classification with logistic loss
async_ingestion Tokio-native async training with concurrent prediction
custom_loss Implementing a custom loss function
drift_detection Abrupt concept drift with recovery analysis
model_checkpointing Save/restore models with prediction verification
streaming_metrics Prequential evaluation with windowed metrics

Minimum Supported Rust Version

The MSRV is 1.75. This is checked in CI and will only be raised in minor version bumps.

References

Gunasekara, N., Pfahringer, B., Gomes, H. M., & Bifet, A. (2024). Gradient boosted trees for evolving data streams. Machine Learning, 113, 3325-3352.

Lundberg, S. M., Erion, G., Chen, H., DeGrave, A., Prutkin, J. M., Nair, B., Katz, R., Himmelfarb, J., Banber, N., & Lee, S.-I. (2020). From local explanations to global understanding with explainable AI for trees. Nature Machine Intelligence, 2, 56-67.

License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in this work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.