irithyll
Streaming Gradient Boosted Trees for evolving data streams.
Irithyll is a pure Rust implementation of the SGBT algorithm (Gunasekara et al., 2024). It learns one sample at a time. No batches, no windows, no retraining. Each tree in the ensemble uses Hoeffding-bound split decisions to grow incrementally, and when the data distribution shifts, concept drift detectors trigger automatic tree replacement so the model stays current.
The paper laid the foundation, but deploying streaming trees in long-running systems required going further. Irithyll adds EWMA leaf decay for continuous forgetting, lazy O(1) histogram decay (because decaying every bin on every sample doesn't scale), proactive tree replacement on a timer, and EFDT-style split re-evaluation at max-depth leaves. Together these close the gap between the research algorithm and a system you can run indefinitely on non-stationary data.
Features
Core Algorithm
- True online learning with
train_one(), one sample at a time - Concept drift detection via Page-Hinkley, ADWIN, or DDM, with automatic tree replacement
- Multi-class support through
MulticlassSGBTwith one-vs-rest committees - Multi-target regression via
MultiTargetSGBTwith T independent models - Three SGBT variants from the paper: Standard, Skip (SGBT-SK), and MultipleIterations (SGBT-MI)
- Pluggable loss functions: squared, logistic, softmax, Huber, or implement the
Losstrait yourself - Hoeffding tree splitting with configurable confidence bounds
- XGBoost-style regularization: L2 (
lambda) and minimum gain (gamma)
Interpretability
- TreeSHAP explanations via
explain()with path-dependent SHAP values (Lundberg et al., 2020) - Named features with
explain_named()for human-readable per-feature contributions - StreamingShap for online running-mean |SHAP| feature importance without storing past data
- Feature importance from accumulated split gain across the ensemble
Streaming Adaptation
These go beyond the original paper to handle the realities of long-running, non-stationary systems:
- EWMA leaf decay (
leaf_half_life): exponential moving average on leaf statistics so the model gradually forgets old data without needing to replace entire trees - Lazy histogram decay: the decay math is O(1) per sample instead of O(n_bins), with exact results. The trick is storing samples in un-decayed coordinates and only materializing the decay when bins are actually read at split evaluation time
- Proactive tree replacement (
max_tree_samples): cycle trees on a timer, independent of drift detectors. Useful when drift is gradual and detectors don't fire - Split re-evaluation (
split_reeval_interval): EFDT-inspired re-checking of max-depth leaves to see if splitting would now help
Production Infrastructure
- Async streaming via
AsyncSGBTwith tokio channels, concurrentPredictorhandles, and backpressure - Model checkpointing with
save_model()/load_model()— drift detector state is preserved across save/load - Online metrics: incremental MAE, MSE, RMSE, R-squared, accuracy, precision, recall, F1, log loss
- Deterministic seeding for reproducible results
- Python bindings via the
irithyll-pythoncrate (PyO3 + numpy, GIL-released train/predict)
Optional Accelerators
- Parallel training (
parallel): Rayon-based data-parallel tree training - SIMD histograms (
simd): AVX2 intrinsics for histogram gradient summation - Arrow integration (
arrow): train fromRecordBatch, predict to arrays - Parquet I/O (
parquet): bulk training directly from Parquet files - ONNX export (
onnx): export trained models for cross-platform inference
Quick Start
Regression
use ;
Binary Classification
use ;
use LogisticLoss;
Explanations
use ;
Multi-Target Regression
use ;
Async Streaming
use ;
use AsyncSGBT;
async
Python
=
=
=
=
# SHAP explanations
=
# Save/load
=
Architecture
irithyll/
loss/ Differentiable loss functions (squared, logistic, softmax, huber)
histogram/ Streaming histogram binning (uniform, quantile, optional k-means)
tree/ Hoeffding-bound streaming decision trees
drift/ Concept drift detectors (Page-Hinkley, ADWIN, DDM) with serializable state
ensemble/ SGBT boosting loop, config, variants, multi-class, multi-target, parallel
explain/ TreeSHAP explanations and StreamingShap online importance
stream/ Async tokio channel-based training runner and predictor handles
metrics/ Online regression and classification metric trackers
serde_support/ Model checkpoint/restore serialization
irithyll-python/ PyO3 Python bindings (StreamingGBT, MultiTargetGBT, ShapExplanation)
Configuration
All hyperparameters go through the builder pattern, validated on build():
use SGBTConfig;
let config = builder
.n_steps // Number of boosting steps (trees)
.learning_rate // Shrinkage factor
.feature_subsample_rate // Fraction of features per tree
.max_depth // Maximum tree depth
.n_bins // Histogram bins per feature
.lambda // L2 regularization
.gamma // Minimum split gain
.grace_period // Samples before evaluating splits
.delta // Hoeffding bound confidence
.feature_names
.build
.expect;
| Parameter | Default | Description |
|---|---|---|
n_steps |
100 | Number of boosting steps (trees in ensemble) |
learning_rate |
0.0125 | Shrinkage factor applied to each tree output |
feature_subsample_rate |
0.75 | Fraction of features sampled per tree |
max_depth |
6 | Maximum depth of each streaming tree |
n_bins |
64 | Number of histogram bins per feature |
lambda |
1.0 | L2 regularization on leaf weights |
gamma |
0.0 | Minimum gain required to make a split |
grace_period |
200 | Minimum samples before evaluating splits |
delta |
1e-7 | Hoeffding bound confidence parameter |
drift_detector |
PageHinkley(0.005, 50.0) | Drift detection algorithm for tree replacement |
variant |
Standard | Computational variant (Standard, Skip, MI) |
feature_names |
None | Optional feature names for named explanations |
leaf_half_life |
None (disabled) | EWMA decay half-life for leaf statistics |
max_tree_samples |
None (disabled) | Proactive tree replacement threshold |
split_reeval_interval |
None (disabled) | Re-evaluation interval for max-depth leaves |
Feature Flags
| Feature | Dependencies | Description |
|---|---|---|
serde-json (default) |
serde_json |
JSON model serialization |
serde-bincode |
bincode |
Compact binary serialization |
parallel |
rayon |
Parallel tree training |
simd |
-- | AVX2 histogram acceleration |
kmeans-binning |
-- | K-means histogram binning |
arrow |
arrow |
Apache Arrow integration |
parquet |
parquet |
Parquet file I/O |
onnx |
prost |
ONNX model export |
neural-leaves |
-- | Experimental MLP leaf models |
full |
all above | Enable everything |
Examples
Run any example with cargo run --example <name>:
| Example | Description |
|---|---|
basic_regression |
Linear regression with RMSE tracking |
classification |
Binary classification with logistic loss |
async_ingestion |
Tokio-native async training with concurrent prediction |
custom_loss |
Implementing a custom loss function |
drift_detection |
Abrupt concept drift with recovery analysis |
model_checkpointing |
Save/restore models with prediction verification |
streaming_metrics |
Prequential evaluation with windowed metrics |
Minimum Supported Rust Version
The MSRV is 1.75. This is checked in CI and will only be raised in minor version bumps.
References
Gunasekara, N., Pfahringer, B., Gomes, H. M., & Bifet, A. (2024). Gradient boosted trees for evolving data streams. Machine Learning, 113, 3325-3352.
Lundberg, S. M., Erion, G., Chen, H., DeGrave, A., Prutkin, J. M., Nair, B., Katz, R., Himmelfarb, J., Banber, N., & Lee, S.-I. (2020). From local explanations to global understanding with explainable AI for trees. Nature Machine Intelligence, 2, 56-67.
License
Licensed under either of
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT License (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
Contribution
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in this work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.