irithyll

Streaming Gradient Boosted Trees for evolving data streams.

Irithyll implements the SGBT algorithm (Gunasekara et al., 2024) in pure Rust, providing incremental gradient boosted tree ensembles that learn one sample at a time. Trees use Hoeffding-bound split decisions and are automatically replaced when concept drift is detected, making the model suitable for non-stationary environments where the data distribution shifts over time.

Built for systems where data never stops — algorithmic trading, IoT telemetry, real-time anomaly detection — and extends the original paper with continuous adaptation mechanisms for production use.

Features

Core Algorithm

True online learning -- train one sample at a time with train_one(), no batching required
Concept drift detection -- automatic tree replacement via Page-Hinkley, ADWIN, or DDM detectors
Multi-class support -- MulticlassSGBT with one-vs-rest committees and softmax normalization
Three SGBT variants -- Standard, Skip (SGBT-SK), and MultipleIterations (SGBT-MI) per the paper
Pluggable loss functions -- squared, logistic, softmax, Huber, or bring your own via the Loss trait
Hoeffding tree splitting -- statistically-grounded split decisions with configurable confidence
XGBoost-style regularization -- L2 (lambda) and minimum gain (gamma) on leaf weights

Streaming Adaptation (beyond the paper)

EWMA leaf decay -- exponential moving average on leaf statistics via leaf_half_life, enabling continuous adaptation without tree replacement
Lazy histogram decay -- O(1) amortized forward decay per sample (not O(n_bins)), mathematically exact with automatic renormalization
Proactive tree replacement -- time-based tree cycling via max_tree_samples, independent of drift detectors
Split re-evaluation -- EFDT-inspired re-evaluation of max-depth leaves via split_reeval_interval

Production Infrastructure

Async tokio-native streaming -- AsyncSGBT with bounded channels, concurrent Predictor handles, and backpressure
Model checkpointing -- save_model() / load_model() for JSON checkpoint/restore with backward-compatible deserialization
Online metrics -- incremental MAE, MSE, RMSE, R-squared, accuracy, precision, recall, F1, and log loss with O(1) state
Feature importance -- accumulated split gain per feature across the ensemble
Deterministic seeding -- reproducible results via SGBTConfig::seed

Optional Accelerators

Parallel training (parallel) -- Rayon-based data-parallel tree training
SIMD histograms (simd) -- AVX2 intrinsics for histogram gradient summation
Arrow integration (arrow) -- train from RecordBatch, predict to arrays
Parquet I/O (parquet) -- bulk training directly from Parquet files
ONNX export (onnx) -- export trained models for cross-platform inference

Quick Start

cargo add irithyll

Regression

use irithyll::{SGBTConfig, SGBT, Sample};

fn main() {
    let config = SGBTConfig::builder()
        .n_steps(50)
        .learning_rate(0.1)
        .build()
        .expect("valid config");

    let mut model = SGBT::new(config);

    // Stream samples one at a time
    for i in 0..500 {
        let x = i as f64 * 0.01;
        let target = 2.0 * x + 1.0;
        model.train_one(&Sample::new(vec![x], target));
    }

    let prediction = model.predict(&[3.0]);
    println!("predict(3.0) = {:.4}", prediction);
}

Binary Classification

use irithyll::{SGBTConfig, SGBT, Sample};
use irithyll::loss::logistic::LogisticLoss;

fn main() {
    let config = SGBTConfig::builder()
        .n_steps(30)
        .learning_rate(0.1)
        .build()
        .expect("valid config");

    let mut model = SGBT::with_loss(config, Box::new(LogisticLoss));

    // Class 0 near (-2, -2), class 1 near (2, 2)
    for _ in 0..500 {
        model.train_one(&Sample::new(vec![-2.0, -2.0], 0.0));
        model.train_one(&Sample::new(vec![ 2.0,  2.0], 1.0));
    }

    let prob = model.predict_proba(&[1.5, 1.5]);
    println!("P(class=1 | [1.5, 1.5]) = {:.4}", prob);
}

Async Streaming

use irithyll::{SGBTConfig, Sample};
use irithyll::stream::AsyncSGBT;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let config = SGBTConfig::builder()
        .n_steps(30)
        .learning_rate(0.1)
        .build()?;

    let mut runner = AsyncSGBT::new(config);
    let sender = runner.sender();
    let predictor = runner.predictor();

    // Spawn the training loop
    let handle = tokio::spawn(async move { runner.run().await });

    // Feed samples from any async context
    for i in 0..500 {
        let x = i as f64 * 0.01;
        sender.send(Sample::new(vec![x], 2.0 * x)).await?;
    }

    // Predict concurrently while training proceeds
    let pred = predictor.predict(&[3.0]);
    println!("prediction = {:.4}", pred);

    // Drop sender to signal shutdown
    drop(sender);
    handle.await??;
    Ok(())
}

Streaming Adaptation

use irithyll::{SGBTConfig, SGBT, Sample};

fn main() {
    let config = SGBTConfig::builder()
        .n_steps(50)
        .learning_rate(0.1)
        // EWMA: leaf statistics half-life of 500 samples
        .leaf_half_life(500)
        // Replace trees after 10K samples regardless of drift
        .max_tree_samples(10_000)
        // Re-evaluate max-depth leaves every 1000 samples
        .split_reeval_interval(1000)
        .build()
        .expect("valid config");

    let mut model = SGBT::new(config);

    // The model now continuously adapts through three mechanisms:
    // 1. Leaf statistics decay exponentially (recent data weighted more)
    // 2. Trees are proactively replaced to prevent staleness
    // 3. Max-depth leaves re-evaluate whether splitting would help
    for i in 0..10_000 {
        let x = i as f64 * 0.001;
        let target = if i < 5000 { 2.0 * x } else { -x + 10.0 };
        model.train_one(&Sample::new(vec![x], target));
    }
}

Architecture

irithyll/
  loss/          Differentiable loss functions (squared, logistic, softmax, huber)
  histogram/     Streaming histogram binning (uniform, quantile, optional k-means)
  tree/          Hoeffding-bound streaming decision trees
  drift/         Concept drift detectors (Page-Hinkley, ADWIN, DDM)
  ensemble/      SGBT boosting loop, config, variants, multi-class, parallel training
  stream/        Async tokio channel-based training runner and predictor handles
  metrics/       Online regression and classification metric trackers
  serde_support/ Model checkpoint/restore serialization

Configuration

All hyperparameters are set via the builder pattern with validation on build():

use irithyll::SGBTConfig;

let config = SGBTConfig::builder()
    .n_steps(100)              // Number of boosting steps (trees)
    .learning_rate(0.0125)     // Shrinkage factor
    .feature_subsample_rate(0.75) // Fraction of features per tree
    .max_depth(6)              // Maximum tree depth
    .n_bins(64)                // Histogram bins per feature
    .lambda(1.0)               // L2 regularization
    .gamma(0.0)                // Minimum split gain
    .grace_period(200)         // Samples before evaluating splits
    .delta(1e-7)               // Hoeffding bound confidence
    .build()
    .expect("valid config");

Parameter	Default	Description
`n_steps`	100	Number of boosting steps (trees in ensemble)
`learning_rate`	0.0125	Shrinkage factor applied to each tree output
`feature_subsample_rate`	0.75	Fraction of features sampled per tree
`max_depth`	6	Maximum depth of each streaming tree
`n_bins`	64	Number of histogram bins per feature
`lambda`	1.0	L2 regularization on leaf weights
`gamma`	0.0	Minimum gain required to make a split
`grace_period`	200	Minimum samples before evaluating splits
`delta`	1e-7	Hoeffding bound confidence parameter
`drift_detector`	PageHinkley(0.005, 50.0)	Drift detection algorithm for tree replacement
`variant`	Standard	Computational variant (Standard, Skip, MI)
`leaf_half_life`	None (disabled)	EWMA decay half-life for leaf statistics
`max_tree_samples`	None (disabled)	Proactive tree replacement threshold
`split_reeval_interval`	None (disabled)	Re-evaluation interval for max-depth leaves

Feature Flags

Feature	Dependencies	Description
`serde-json` (default)	`serde_json`	JSON model serialization
`serde-bincode`	`bincode`	Compact binary serialization
`parallel`	`rayon`	Parallel tree training
`simd`	--	AVX2 histogram acceleration
`kmeans-binning`	--	K-means histogram binning
`arrow`	`arrow`	Apache Arrow integration
`parquet`	`parquet`	Parquet file I/O
`onnx`	`prost`	ONNX model export
`neural-leaves`	--	Experimental MLP leaf models
`full`	all above	Enable everything

Examples

Run any example with cargo run --example <name>:

Example	Description
`basic_regression`	Linear regression with RMSE tracking
`classification`	Binary classification with logistic loss
`async_ingestion`	Tokio-native async training with concurrent prediction
`custom_loss`	Implementing a custom loss function
`drift_detection`	Abrupt concept drift with recovery analysis
`model_checkpointing`	Save/restore models with prediction verification
`streaming_metrics`	Prequential evaluation with windowed metrics

Minimum Supported Rust Version

The MSRV is 1.75. This is checked in CI and will only be raised in minor version bumps.

References

Gunasekara, N., Pfahringer, B., Gomes, H. M., & Bifet, A. (2024). Gradient boosted trees for evolving data streams. Machine Learning, 113, 3325--3352.

License

Licensed under either of

Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT License (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in this work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

irithyll 1.0.0