irithyll
Streaming Gradient Boosted Trees for evolving data streams.
Irithyll implements the SGBT algorithm (Gunasekara et al., 2024) in pure Rust, providing incremental gradient boosted tree ensembles that learn one sample at a time. Trees use Hoeffding-bound split decisions and are automatically replaced when concept drift is detected, making the model suitable for non-stationary environments where the data distribution shifts over time.
Built for systems where data never stops — algorithmic trading, IoT telemetry, real-time anomaly detection — and extends the original paper with continuous adaptation mechanisms for production use.
Features
Core Algorithm
- True online learning -- train one sample at a time with
train_one(), no batching required - Concept drift detection -- automatic tree replacement via Page-Hinkley, ADWIN, or DDM detectors
- Multi-class support --
MulticlassSGBTwith one-vs-rest committees and softmax normalization - Three SGBT variants -- Standard, Skip (SGBT-SK), and MultipleIterations (SGBT-MI) per the paper
- Pluggable loss functions -- squared, logistic, softmax, Huber, or bring your own via the
Losstrait - Hoeffding tree splitting -- statistically-grounded split decisions with configurable confidence
- XGBoost-style regularization -- L2 (
lambda) and minimum gain (gamma) on leaf weights
Streaming Adaptation (beyond the paper)
- EWMA leaf decay -- exponential moving average on leaf statistics via
leaf_half_life, enabling continuous adaptation without tree replacement - Lazy histogram decay -- O(1) amortized forward decay per sample (not O(n_bins)), mathematically exact with automatic renormalization
- Proactive tree replacement -- time-based tree cycling via
max_tree_samples, independent of drift detectors - Split re-evaluation -- EFDT-inspired re-evaluation of max-depth leaves via
split_reeval_interval
Production Infrastructure
- Async tokio-native streaming --
AsyncSGBTwith bounded channels, concurrentPredictorhandles, and backpressure - Model checkpointing --
save_model()/load_model()for JSON checkpoint/restore with backward-compatible deserialization - Online metrics -- incremental MAE, MSE, RMSE, R-squared, accuracy, precision, recall, F1, and log loss with O(1) state
- Feature importance -- accumulated split gain per feature across the ensemble
- Deterministic seeding -- reproducible results via
SGBTConfig::seed
Optional Accelerators
- Parallel training (
parallel) -- Rayon-based data-parallel tree training - SIMD histograms (
simd) -- AVX2 intrinsics for histogram gradient summation - Arrow integration (
arrow) -- train fromRecordBatch, predict to arrays - Parquet I/O (
parquet) -- bulk training directly from Parquet files - ONNX export (
onnx) -- export trained models for cross-platform inference
Quick Start
Regression
use ;
Binary Classification
use ;
use LogisticLoss;
Async Streaming
use ;
use AsyncSGBT;
async
Streaming Adaptation
use ;
Architecture
irithyll/
loss/ Differentiable loss functions (squared, logistic, softmax, huber)
histogram/ Streaming histogram binning (uniform, quantile, optional k-means)
tree/ Hoeffding-bound streaming decision trees
drift/ Concept drift detectors (Page-Hinkley, ADWIN, DDM)
ensemble/ SGBT boosting loop, config, variants, multi-class, parallel training
stream/ Async tokio channel-based training runner and predictor handles
metrics/ Online regression and classification metric trackers
serde_support/ Model checkpoint/restore serialization
Configuration
All hyperparameters are set via the builder pattern with validation on build():
use SGBTConfig;
let config = builder
.n_steps // Number of boosting steps (trees)
.learning_rate // Shrinkage factor
.feature_subsample_rate // Fraction of features per tree
.max_depth // Maximum tree depth
.n_bins // Histogram bins per feature
.lambda // L2 regularization
.gamma // Minimum split gain
.grace_period // Samples before evaluating splits
.delta // Hoeffding bound confidence
.build
.expect;
| Parameter | Default | Description |
|---|---|---|
n_steps |
100 | Number of boosting steps (trees in ensemble) |
learning_rate |
0.0125 | Shrinkage factor applied to each tree output |
feature_subsample_rate |
0.75 | Fraction of features sampled per tree |
max_depth |
6 | Maximum depth of each streaming tree |
n_bins |
64 | Number of histogram bins per feature |
lambda |
1.0 | L2 regularization on leaf weights |
gamma |
0.0 | Minimum gain required to make a split |
grace_period |
200 | Minimum samples before evaluating splits |
delta |
1e-7 | Hoeffding bound confidence parameter |
drift_detector |
PageHinkley(0.005, 50.0) | Drift detection algorithm for tree replacement |
variant |
Standard | Computational variant (Standard, Skip, MI) |
leaf_half_life |
None (disabled) | EWMA decay half-life for leaf statistics |
max_tree_samples |
None (disabled) | Proactive tree replacement threshold |
split_reeval_interval |
None (disabled) | Re-evaluation interval for max-depth leaves |
Feature Flags
| Feature | Dependencies | Description |
|---|---|---|
serde-json (default) |
serde_json |
JSON model serialization |
serde-bincode |
bincode |
Compact binary serialization |
parallel |
rayon |
Parallel tree training |
simd |
-- | AVX2 histogram acceleration |
kmeans-binning |
-- | K-means histogram binning |
arrow |
arrow |
Apache Arrow integration |
parquet |
parquet |
Parquet file I/O |
onnx |
prost |
ONNX model export |
neural-leaves |
-- | Experimental MLP leaf models |
full |
all above | Enable everything |
Examples
Run any example with cargo run --example <name>:
| Example | Description |
|---|---|
basic_regression |
Linear regression with RMSE tracking |
classification |
Binary classification with logistic loss |
async_ingestion |
Tokio-native async training with concurrent prediction |
custom_loss |
Implementing a custom loss function |
drift_detection |
Abrupt concept drift with recovery analysis |
model_checkpointing |
Save/restore models with prediction verification |
streaming_metrics |
Prequential evaluation with windowed metrics |
Minimum Supported Rust Version
The MSRV is 1.75. This is checked in CI and will only be raised in minor version bumps.
References
Gunasekara, N., Pfahringer, B., Gomes, H. M., & Bifet, A. (2024). Gradient boosted trees for evolving data streams. Machine Learning, 113, 3325--3352.
License
Licensed under either of
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT License (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
Contribution
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in this work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.