TreeBoost

High-performance GBDT for Rust. GPU-accelerated by default. Production-ready.
TreeBoost is a gradient boosted decision tree engine in pure Rust with automatic multi-backend hardware acceleration. Supports WGPU (all GPUs: NVIDIA, AMD, Intel, Apple), AVX-512, SVE2, with optimized scalar fallback. Works out of the box with zero configuration.
Why TreeBoost?
Other Rust GBDT libraries are basic. TreeBoost gives you the performance and features you'd expect from LightGBM or XGBoost—but built for Rust developers, fully typed, and production-ready.
Why Rust GBDT?
- Zero-copy, type-safe data handling
- Deploy without runtime overhead
- Memory safety guarantees
- Excellent for systems where Python isn't an option
What You Get:
- 5-5.5× faster on GPU than scalar CPU for large datasets (100K+ rows)
- Zero configuration — automatic backend selection (GPU → AVX-512 → scalar fallback)
- Advanced features — entropy regularization, conformal intervals, target encoding
- Production features — model checkpointing, inference optimization, feature importance
Quick Start
Rust (Native)
use ;
use DatasetLoader;
let loader = new;
let dataset = loader.load_parquet?;
let config = new
.with_num_rounds
.with_max_depth
.with_learning_rate;
let model = train_binned?;
save_model?;
let predictions = model.predict?;
Python (via PyO3)
=
=
=
= 100
= 6
= 0.1
=
=
How It Works: Automatic Backend Selection
flowchart TD
A{GPU Available?} -->|YES| B[WGPU Tensor-Tile<br/>Vulkan/Metal/DX12]
A -->|NO| C{CPU Architecture}
C -->|x86-64| D{AVX-512?}
C -->|ARM| E{SVE2?}
D -->|YES| F[AVX-512 Tensor-Tile<br/>vpconflictd parallel]
D -->|NO| G[Scalar Backend<br/>AVX2 loads]
E -->|YES| H[SVE2 Tensor-Tile<br/>HISTCNT direct]
E -->|NO| I[Scalar Backend<br/>NEON loads]
WebGPU backend: Works on all GPUs (NVIDIA, AMD, Intel, Apple) via Vulkan, Metal, or DX12. Designed for portability - no installation required beyond your system drivers. Uses Hybrid mode (GPU histogram + CPU tree growth) due to WebGPU's higher dispatch overhead.
CUDA backend: Enables Full GPU mode with custom kernels - 2x+ faster than WebGPU on NVIDIA hardware. Low dispatch latency allows the entire tree building pipeline to run on GPU (histogram, partition, level-wise growth). The speedup grows with larger datasets. Optional but recommended for NVIDIA users.
Coming soon: Native Metal and ROCm backends for Apple and AMD GPUs.
CPU backends: AVX-512 (3rd Gen Xeon+), SVE2 (ARM Neoverse), with optimized scalar fallback.
Explicit Backend Selection
By default, TreeBoost auto-detects the best backend. Specify backends explicitly to override:
Rust:
use ;
use BackendType;
let config = new
.with_num_rounds
.with_max_depth
.with_backend; // Force CPU (AVX2/NEON)
let model = train?;
Available backends:
Scalar // CPU: AVX2 (x86) or NEON (ARM) - no GPU overhead
Avx512 // CPU: AVX-512 tensor-tile (x86-64 only)
Sve2 // CPU: SVE2 tensor-tile (ARM only)
Wgpu // GPU: All GPUs via Vulkan/Metal/DX12 (portable)
Cuda // GPU: NVIDIA CUDA (2x+ faster than WGPU)
Auto // (Default) Auto-detect: CUDA > WGPU > AVX-512 > SVE2 > Scalar
Python:
=
= 100
= 6
= # Force CPU
=
Performance
Competitive Benchmarks
Inference: Optimized for CPU execution via Rayon parallelism. Fast inference on standard compute eliminates GPU deployment overhead—no need for expensive GPU VMs just to serve predictions.
Training: Automatic backend selection balances speed and cost. CPU training is already fast for datasets <100K rows; GPU acceleration (CUDA/WGPU) provides significant speedup for larger datasets (100K–1B+ rows) where the computational advantage justifies GPU deployment.
Compared to other pure-Rust GBDT implementations:
Inference (per-batch prediction):
| Dataset | TreeBoost | gbdt-rs | forust | Speedup |
|---|---|---|---|---|
| 100 samples | 47.4 µs | 135.5 µs | 92.9 µs | 2.9x vs gbdt-rs |
| 1K samples | 202 µs | 1.29 ms | 893 µs | 6.4x vs gbdt-rs |
| 10K samples | 539 µs | 11.7 ms | 8.9 ms | 21.7x vs gbdt-rs |
Training:
| Dataset | TreeBoost | gbdt-rs | forust | Speedup |
|---|---|---|---|---|
| 100K rows, 50 rounds | 263 ms | 3,389 ms | 581 ms | 12.9x vs gbdt-rs |
| 100K rows, 100 rounds (parallel) | 344 ms | 6,600 ms | 2,020 ms | 19.2x vs gbdt-rs |
Benchmarks: NVIDIA CUDA (Full GPU mode), raw float32 data, per-iteration time. See benches/competitors.rs for reproducible methodology.
Running Benchmarks:
# CPU-only comparison (fast, ~2 minutes)
# GPU-enabled comparison (with CUDA acceleration)
# Python cross-library comparison
Core Features
Robustness
- Shannon Entropy regularization — Prevent drift across time windows
- Pseudo-Huber loss — Automatic outlier handling (smoother than MSE)
- Split Conformal Prediction — Distribution-free uncertainty intervals on predictions
Data Handling
- Ordered Target Encoding — High-cardinality categoricals without target leakage
- Count-Min Sketch — Automatic rare category compression (memory efficient)
Model Control
- Monotonic/Interaction constraints — Enforce domain knowledge
- Feature importance — Understand model decisions
Production
- Zero-copy serialization — 100MB+ models load in milliseconds via rkyv
- Streaming inference — Predict on 1M rows in seconds
Installation
Rust Library
Python Package
# From PyPI
# From source (requires Rust toolchain)
&&
More Examples
Rust: Train and Save
use ;
use DatasetLoader;
// Load data
let loader = new;
let dataset = loader.load_parquet?;
// Configure and train
let config = new
.with_num_rounds
.with_max_depth
.with_learning_rate
.with_entropy_weight; // Regularize for drift
let model = train_binned?;
save_model?;
// Load and predict
let predictions = model.predict?;
let importances = model.feature_importance;
Python: Conformal Prediction
=
= + * 0.5
=
= 100
= 6
= 0.2 # Reserve 20% for uncertainty estimation
= 0.9 # 90% prediction intervals
=
, , =
# Now you have uncertainty bounds on every prediction
Python: Categorical Features
=
# Target encoding for high-cardinality categorical
=
= 100
= True # Ordered encoding, no leakage
= 100 # Rare categories → "Unknown"
=
=
=
CLI Tool
If you're using the binary distribution:
# Train a model
# Make predictions
# Inspect the model
Run treeboost <command> --help for all available options.
Configuration Reference
Core Hyperparameters
| Parameter | Default | Description |
|---|---|---|
num_rounds |
100 | Number of boosting iterations |
max_depth |
6 | Maximum tree depth (deeper = more expressive but slower) |
learning_rate |
0.1 | Shrinkage per round (lower = more stable but slower training) |
max_leaves |
31 | Maximum leaves per tree |
lambda |
1.0 | L2 leaf regularization |
loss |
mse |
mse or huber (huber for outliers) |
Advanced Features
| Parameter | Default | Description |
|---|---|---|
entropy_weight |
0.0 | Shannon entropy penalty (prevents drift) |
subsample |
1.0 | Row sampling ratio per round |
colsample |
1.0 | Feature sampling ratio per tree |
calibration_ratio |
0.0 | Fraction of data reserved for conformal calibration |
conformal_quantile |
0.9 | Quantile for prediction intervals (0.9 = 90% coverage) |
use_target_encoding |
false | Enable ordered target encoding for categoricals |
cms_threshold |
0 | Rare category threshold (0 = disabled) |
Constraints
=
=
Troubleshooting
Check which backend is being used:
RUST_LOG=treeboost=debug
GPU not detected:
- Verify your GPU drivers are installed (NVIDIA, AMD, Intel, or Apple)
- WGPU supports Vulkan (Linux), Metal (macOS), DX12 (Windows)
- For NVIDIA CUDA: Install CUDA 12.x separately
Out of memory during training:
Model won't load:
- Ensure you're using the same TreeBoost version for save/load
- The
.rkyvfile is tied to the binary layout; recompiling TreeBoost may break compatibility
Acknowledgments
TreeBoost builds on the collective knowledge of the GBDT community. We acknowledge the following projects that shaped our design and implementation:
- XGBoost — Industry-standard GBDT with GPU support; inspired our histogram-based approach and Full GPU mode architecture.
- LightGBM — Leaf-wise growth strategy and histogram optimization techniques.
- CatBoost — Ordered target encoding for categorical features and conformal prediction intervals.
- Forust — Pure-Rust GBDT implementation; motivated our focus on Rust-first performance.
- WarpGBM — GPU-accelerated histogram building patterns.
License
Apache License 2.0