treeboost 0.1.0-beta.1

High-performance Gradient Boosted Decision Tree engine for large-scale tabular data
Documentation

TreeBoost

TreeBoost

High-performance GBDT for Rust. GPU-accelerated by default. Production-ready.

TreeBoost is a gradient boosted decision tree engine in pure Rust with automatic multi-backend hardware acceleration. Supports WGPU (all GPUs: NVIDIA, AMD, Intel, Apple), AVX-512, SVE2, with optimized scalar fallback. Works out of the box with zero configuration.

Why TreeBoost?

Other Rust GBDT libraries are basic. TreeBoost gives you the performance and features you'd expect from LightGBM or XGBoost—but built for Rust developers, fully typed, and production-ready.

Why Rust GBDT?

  • Zero-copy, type-safe data handling
  • Deploy without runtime overhead
  • Memory safety guarantees
  • Excellent for systems where Python isn't an option

What You Get:

  • 5-5.5× faster on GPU than scalar CPU for large datasets (100K+ rows)
  • Zero configuration — automatic backend selection (GPU → AVX-512 → scalar fallback)
  • Advanced features — entropy regularization, conformal intervals, target encoding
  • Production features — model checkpointing, inference optimization, feature importance

Quick Start

Rust (Native)

use treeboost::{GBDTConfig, GBDTModel};
use treeboost::dataset::DatasetLoader;

let loader = DatasetLoader::new(255);
let dataset = loader.load_parquet("data.parquet", "target", None)?;

let config = GBDTConfig::new()
    .with_num_rounds(100)
    .with_max_depth(6)
    .with_learning_rate(0.1);

let model = GBDTModel::train_binned(&dataset, config)?;
treeboost::serialize::save_model(&model, "model.rkyv")?;

let predictions = model.predict(&dataset)?;

Python (via PyO3)

import numpy as np
from treeboost import GBDTConfig, GBDTModel

X = np.random.randn(10000, 20).astype(np.float32)
y = (X[:, 0] + X[:, 1] * 2 + np.random.randn(10000) * 0.1).astype(np.float32)

config = GBDTConfig()
config.num_rounds = 100
config.max_depth = 6
config.learning_rate = 0.1

model = GBDTModel.train(X, y, config)
predictions = model.predict(X)
model.save("model.rkyv")

How It Works: Automatic Backend Selection

flowchart TD
    A{GPU Available?} -->|YES| B[WGPU Tensor-Tile<br/>Vulkan/Metal/DX12]
    A -->|NO| C{CPU Architecture}

    C -->|x86-64| D{AVX-512?}
    C -->|ARM| E{SVE2?}

    D -->|YES| F[AVX-512 Tensor-Tile<br/>vpconflictd parallel]
    D -->|NO| G[Scalar Backend<br/>AVX2 loads]

    E -->|YES| H[SVE2 Tensor-Tile<br/>HISTCNT direct]
    E -->|NO| I[Scalar Backend<br/>NEON loads]

WebGPU backend: Works on all GPUs (NVIDIA, AMD, Intel, Apple) via Vulkan, Metal, or DX12. Designed for portability - no installation required beyond your system drivers. Uses Hybrid mode (GPU histogram + CPU tree growth) due to WebGPU's higher dispatch overhead.

CUDA backend: Enables Full GPU mode with custom kernels - 2x+ faster than WebGPU on NVIDIA hardware. Low dispatch latency allows the entire tree building pipeline to run on GPU (histogram, partition, level-wise growth). The speedup grows with larger datasets. Optional but recommended for NVIDIA users.

Coming soon: Native Metal and ROCm backends for Apple and AMD GPUs.

CPU backends: AVX-512 (3rd Gen Xeon+), SVE2 (ARM Neoverse), with optimized scalar fallback.

Explicit Backend Selection

By default, TreeBoost auto-detects the best backend. Specify backends explicitly to override:

Rust:

use treeboost::{GBDTConfig, GBDTModel};
use treeboost::backend::BackendType;

let config = GBDTConfig::new()
    .with_num_rounds(100)
    .with_max_depth(6)
    .with_backend(BackendType::Scalar);  // Force CPU (AVX2/NEON)

let model = GBDTModel::train(&features, num_features, &targets, config, None)?;

Available backends:

BackendType::Scalar   // CPU: AVX2 (x86) or NEON (ARM) - no GPU overhead
BackendType::Avx512  // CPU: AVX-512 tensor-tile (x86-64 only)
BackendType::Sve2    // CPU: SVE2 tensor-tile (ARM only)
BackendType::Wgpu    // GPU: All GPUs via Vulkan/Metal/DX12 (portable)
BackendType::Cuda    // GPU: NVIDIA CUDA (2x+ faster than WGPU)
BackendType::Auto    // (Default) Auto-detect: CUDA > WGPU > AVX-512 > SVE2 > Scalar

Python:

from treeboost import GBDTConfig, GBDTModel

config = GBDTConfig()
config.num_rounds = 100
config.max_depth = 6
config.backend = "scalar"  # Force CPU

model = GBDTModel.train(X, y, config)

Performance

Competitive Benchmarks

Inference: Optimized for CPU execution via Rayon parallelism. Fast inference on standard compute eliminates GPU deployment overhead—no need for expensive GPU VMs just to serve predictions.

Training: Automatic backend selection balances speed and cost. CPU training is already fast for datasets <100K rows; GPU acceleration (CUDA/WGPU) provides significant speedup for larger datasets (100K–1B+ rows) where the computational advantage justifies GPU deployment.

Compared to other pure-Rust GBDT implementations:

Inference (per-batch prediction):

Dataset TreeBoost gbdt-rs forust Speedup
100 samples 47.4 µs 135.5 µs 92.9 µs 2.9x vs gbdt-rs
1K samples 202 µs 1.29 ms 893 µs 6.4x vs gbdt-rs
10K samples 539 µs 11.7 ms 8.9 ms 21.7x vs gbdt-rs

Training:

Dataset TreeBoost gbdt-rs forust Speedup
100K rows, 50 rounds 263 ms 3,389 ms 581 ms 12.9x vs gbdt-rs
100K rows, 100 rounds (parallel) 344 ms 6,600 ms 2,020 ms 19.2x vs gbdt-rs

Benchmarks: NVIDIA CUDA (Full GPU mode), raw float32 data, per-iteration time. See benches/competitors.rs for reproducible methodology.

Running Benchmarks:

# CPU-only comparison (fast, ~2 minutes)
cargo bench --bench competitors

# GPU-enabled comparison (with CUDA acceleration)
cargo bench --bench competitors --features gpu,cuda

# Python cross-library comparison
python benchmarks/benchmark.py --mode cross-library-gpu

Core Features

Robustness

  • Shannon Entropy regularization — Prevent drift across time windows
  • Pseudo-Huber loss — Automatic outlier handling (smoother than MSE)
  • Split Conformal Prediction — Distribution-free uncertainty intervals on predictions

Data Handling

  • Ordered Target Encoding — High-cardinality categoricals without target leakage
  • Count-Min Sketch — Automatic rare category compression (memory efficient)

Model Control

  • Monotonic/Interaction constraints — Enforce domain knowledge
  • Feature importance — Understand model decisions

Production

  • Zero-copy serialization — 100MB+ models load in milliseconds via rkyv
  • Streaming inference — Predict on 1M rows in seconds

Installation

Rust Library

cargo add treeboost

Python Package

# From PyPI
pip install treeboost

# From source (requires Rust toolchain)
git clone https://github.com/your-org/treeboost
cd treeboost
pip install maturin && maturin develop --release

More Examples

Rust: Train and Save

use treeboost::{GBDTConfig, GBDTModel};
use treeboost::dataset::DatasetLoader;

// Load data
let loader = DatasetLoader::new(255);
let dataset = loader.load_parquet("train.parquet", "target", None)?;

// Configure and train
let config = GBDTConfig::new()
    .with_num_rounds(200)
    .with_max_depth(8)
    .with_learning_rate(0.05)
    .with_entropy_weight(0.1);  // Regularize for drift

let model = GBDTModel::train_binned(&dataset, config)?;
treeboost::serialize::save_model(&model, "model.rkyv")?;

// Load and predict
let predictions = model.predict(&dataset)?;
let importances = model.feature_importance();

Python: Conformal Prediction

from treeboost import GBDTConfig, GBDTModel

X = np.random.randn(10000, 50).astype(np.float32)
y = np.sum(X[:, :5], axis=1) + np.random.randn(10000) * 0.5

config = GBDTConfig()
config.num_rounds = 100
config.max_depth = 6
config.calibration_ratio = 0.2    # Reserve 20% for uncertainty estimation
config.conformal_quantile = 0.9   # 90% prediction intervals

model = GBDTModel.train(X, y, config)
preds, lower, upper = model.predict_with_intervals(X_test)

# Now you have uncertainty bounds on every prediction
print(f"Prediction: {preds[0]:.2f}, [{lower[0]:.2f}, {upper[0]:.2f}]")

Python: Categorical Features

import pandas as pd
from treeboost import GBDTConfig, GBDTModel

df = pd.read_csv("data.csv")

# Target encoding for high-cardinality categorical
config = GBDTConfig()
config.num_rounds = 100
config.use_target_encoding = True    # Ordered encoding, no leakage
config.cms_threshold = 100           # Rare categories → "Unknown"

X = df[feature_cols].values.astype(np.float32)
y = df['target'].values.astype(np.float32)

model = GBDTModel.train(X, y, config)

CLI Tool

If you're using the binary distribution:

# Train a model
treeboost train --data data.csv --target price --output model.rkyv \
  --rounds 100 --max-depth 6 --learning-rate 0.1

# Make predictions
treeboost predict --model model.rkyv --data test.csv --output predictions.json

# Inspect the model
treeboost info --model model.rkyv --importances

Run treeboost <command> --help for all available options.

Configuration Reference

Core Hyperparameters

Parameter Default Description
num_rounds 100 Number of boosting iterations
max_depth 6 Maximum tree depth (deeper = more expressive but slower)
learning_rate 0.1 Shrinkage per round (lower = more stable but slower training)
max_leaves 31 Maximum leaves per tree
lambda 1.0 L2 leaf regularization
loss mse mse or huber (huber for outliers)

Advanced Features

Parameter Default Description
entropy_weight 0.0 Shannon entropy penalty (prevents drift)
subsample 1.0 Row sampling ratio per round
colsample 1.0 Feature sampling ratio per tree
calibration_ratio 0.0 Fraction of data reserved for conformal calibration
conformal_quantile 0.9 Quantile for prediction intervals (0.9 = 90% coverage)
use_target_encoding false Enable ordered target encoding for categoricals
cms_threshold 0 Rare category threshold (0 = disabled)

Constraints

config.monotonic_constraints = [
    MonotonicConstraint.Increasing,   # Feature 0
    MonotonicConstraint.None,         # Feature 1
    MonotonicConstraint.Decreasing,   # Feature 2
]

config.interaction_groups = [
    [0, 1, 2],  # These features can interact
    [3, 4],     # Separate interaction group
]

Troubleshooting

Check which backend is being used:

RUST_LOG=treeboost=debug treeboost train ...

GPU not detected:

  • Verify your GPU drivers are installed (NVIDIA, AMD, Intel, or Apple)
  • WGPU supports Vulkan (Linux), Metal (macOS), DX12 (Windows)
  • For NVIDIA CUDA: Install CUDA 12.x separately

Out of memory during training:

treeboost train ... --subsample 0.8 --colsample 0.8

Model won't load:

  • Ensure you're using the same TreeBoost version for save/load
  • The .rkyv file is tied to the binary layout; recompiling TreeBoost may break compatibility

Acknowledgments

TreeBoost builds on the collective knowledge of the GBDT community. We acknowledge the following projects that shaped our design and implementation:

  • XGBoost — Industry-standard GBDT with GPU support; inspired our histogram-based approach and Full GPU mode architecture.
  • LightGBM — Leaf-wise growth strategy and histogram optimization techniques.
  • CatBoost — Ordered target encoding for categorical features and conformal prediction intervals.
  • Forust — Pure-Rust GBDT implementation; motivated our focus on Rust-first performance.
  • WarpGBM — GPU-accelerated histogram building patterns.

License

Apache License 2.0