scry-learn

A scikit-learn-style ML toolkit in pure Rust. No Python runtime, no BLAS, no LAPACK — cargo add scry-learn and build.

use scry_learn::prelude::*;

let data = Dataset::from_csv("iris.csv", "species")?;
let (train, test) = train_test_split(&data, 0.2, 42);

let mut rf = RandomForestClassifier::new()
    .n_estimators(100)
    .max_depth(10);
rf.fit(&train)?;

let preds = rf.predict(&test)?;
println!("{}", classification_report(&test.target, &preds));

Status: 0.x, pre-1.0 — breaking changes are possible between minor versions. This started as a learning project (implement each algorithm from scratch to understand it) and grew into something usable, but it isn't a drop-in replacement for an established stack. If you need a settled Rust ML library today, linfa is the safer pick. The benchmarks below let you decide whether scry-learn fits your use case.

What's in the box

Pure-Rust dependencies. No Python, no BLAS/LAPACK, no system libraries.
TreeSHAP and permutation importance built in.
Cross-library benchmarks vs linfa and smartcore, with a counting allocator, single-thread enforcement, and accuracy parity gates. Run them yourself.
Column-major data layout — tree models scan features without a transpose.
#![deny(unsafe_code)].

Algorithms

Model	Classification	Regression
Decision Tree (CART)	✓	✓
Random Forest	✓	✓
Gradient Boosting	✓	✓
Histogram Gradient Boosting	✓	✓
Linear / Logistic Regression	✓	✓
Ridge	—	✓
Lasso	—	✓
ElasticNet	—	✓
Linear SVM	✓	✓
Kernel SVM	✓*	✓*
K-Nearest Neighbors	✓	✓
Gaussian Naive Bayes	✓	—
Multinomial Naive Bayes	✓	—
Bernoulli Naive Bayes	✓	—
MLP Neural Network	✓	✓

* Kernel SVM requires features = ["experimental"]

Algorithm	Notes
K-Means	k-means++ init, configurable max_iter
Mini-Batch K-Means	Streaming-friendly variant
DBSCAN	Density-based, automatic cluster count
HDBSCAN	Hierarchical density-based
Agglomerative	Ward / complete / average / single linkage

Scaling: StandardScaler, MinMaxScaler, RobustScaler, Normalizer (L1/L2)
Encoding: OneHotEncoder, LabelEncoder
Imputation: SimpleImputer (mean, median, most-frequent, constant)
Dimensionality: PCA, VarianceThreshold, SelectKBest (f_classif)
Transforms: PolynomialFeatures, ColumnTransformer, Pipeline

Search: GridSearchCV, RandomizedSearchCV, BayesSearchCV
Validation: cross_val_score, stratified k-fold, group k-fold, time series split, repeated CV
Classification metrics: accuracy, precision, recall, F1, balanced accuracy, Cohen's kappa, confusion matrix, ROC AUC, PR curve, log loss
Regression metrics: MSE, MAPE, R², explained variance
Clustering metrics: silhouette score, Calinski-Harabasz, Davies-Bouldin, adjusted Rand index
Calibration: Platt scaling, isotonic regression

TreeSHAP — exact Shapley values for tree ensembles in polynomial time (Lundberg & Lee, 2018)
Permutation importance — model-agnostic feature importance with configurable repeats (Breiman, 2001)

use scry_learn::prelude::*;

let shap_values = ensemble_tree_shap(&rf, &test.features);
let importance = permutation_importance(&rf, &test, accuracy, 5);

CountVectorizer — n-gram term counts, min/max document frequency, sparse CSR output
TfidfVectorizer — TF-IDF with L1/L2 normalization, sublinear TF, smooth IDF
Tokenizer — zero-dependency whitespace/punctuation-aware tokenizer

Isolation Forest — unsupervised anomaly detection via random partitioning

sklearn → scry-learn

The API closely tracks scikit-learn.

scikit-learn (Python)	scry-learn (Rust)
`from sklearn.ensemble import RandomForestClassifier`	`use scry_learn::prelude::*;`
`rf = RandomForestClassifier(n_estimators=100)`	`let mut rf = RandomForestClassifier::new().n_estimators(100);`
`rf.fit(X_train, y_train)`	`rf.fit(&train)?;`
`rf.predict(X_test)`	`rf.predict(&test)?`
`cross_val_score(rf, X, y, cv=5)`	`cross_val_score(&rf, &data, 5, accuracy)`
`GridSearchCV(rf, param_grid, cv=5)`	`GridSearchCV::new(rf, param_grid, 5, accuracy)`
`shap.TreeExplainer(rf).shap_values(X)`	`ensemble_tree_shap(&rf, &features)`
`StandardScaler().fit_transform(X)`	`StandardScaler::new().fit_transform(&mut data)?`

Benchmarks

Cross-library benchmarks against linfa and smartcore. The harness enforces:

Real UCI datasets only — no synthetic data with RNG bias
Counting allocator — actual heap bytes, not RSS estimates
Single-thread execution, asserted programmatically (not assumed via env var)
Accuracy parity gates — timing only reported when all libraries converge within ε=3%
Identical preprocessing across libraries

Run them yourself:

cargo bench --bench fair_bench -p scry-learn
cargo bench --bench honest_bench -p scry-learn

# Extended scaling curves (500 / 2K / 10K samples)
cargo bench --bench fair_bench -p scry-learn --features extended-bench

Install

[dependencies]
scry-learn = "0.1"

Optional features

Feature	What it enables
`csv`	`Dataset::from_csv()` file loading
`serde`	Serialize / deserialize models
`polars`	Polars DataFrame interop
`mmap`	Memory-mapped dataset loading
`experimental`	Kernel SVM (RBF, polynomial kernels)

scry-learn = { version = "0.1", features = ["csv", "serde"] }

Examples

# 5-fold stratified CV across 8 models on 4 UCI datasets
cargo run --example industry_report -p scry-learn --release

# Head-to-head comparison vs linfa and smartcore
cargo run --example crosslib_comparison -p scry-learn --release

Test suite

843 tests across 24 test files cover correctness, convergence, numerical stability, and cross-library parity.

cargo test -p scry-learn

Test suite	What it validates
`correctness`	sklearn reference accuracy verification
`convergence`	Monotonic improvement and max_iter stability
`numerical_stability`	NaN/Inf handling, gradient norm tracking
`mathematical_invariants`	SHAP additivity (Σφᵢ = pred − E[f(x)])
`golden_regression_test`	Deterministic snapshot tests
`statistical_robustness`	Bootstrap confidence interval validity
`edge_cases`	Empty datasets, single samples, NaN/Inf inputs
`production_bench`	Heap memory, allocation counts, scaling curves
`memory_crosslib`	Heap usage comparison across libraries

Contributing

Issues and PRs welcome. Please open an issue before large changes.

License

MIT OR Apache-2.0

scry-learn 0.1.0