anofox-ml

A scikit-learn-inspired machine learning library for Rust, built on ndarray.

Features

Category	anofox-ml	scikit-learn equivalent
Preprocessing & scaling	`StandardScaler`, `MinMaxScaler`, `MaxAbsScaler`, `RobustScaler`, `Normalizer`, `Binarizer`, `KBinsDiscretizer`, `PolynomialFeatures`, `PowerTransformer`, `QuantileTransformer`, `SimpleImputer`, `OneHotEncoder`, `OrdinalEncoder`, `LabelEncoder`	`sklearn.preprocessing`, `sklearn.impute`
Decomposition	`Pca`, `TruncatedSvd`, `KernelPca`, `Nmf` (NNDSVD), `FastIca`	`sklearn.decomposition`
Cross-decomposition	`PlsRegression` (PLS1), `Cca`	`sklearn.cross_decomposition`
Manifold learning	`ClassicalMds`, `Isomap`, `LocallyLinearEmbedding`, `TSne`	`sklearn.manifold`
Feature selection	`VarianceThreshold`, `MutualInformationSelector`, `SelectKBest`, `SelectFromModel`, `Rfe`, `Rfecv`, `SequentialFeatureSelector`	`sklearn.feature_selection`
Neighbors	`KnnClassifier`, `KnnRegressor` (KD-tree), `LocalOutlierFactor`	`sklearn.neighbors`
Linear models	`OlsRegressor`, `RidgeRegressor` (+CV, +sample_weight), `LassoRegressor` (+CV), `ElasticNetRegressor` (+CV), `HuberRegressor`, `QuantileRegressor`, `IsotonicRegressor`, `WlsRegressor`, `LogisticRegressor`, `BayesianRidge`, `ARDRegression`, `Lars`, `LassoLarsIC`, `OrthogonalMatchingPursuit`, `RansacRegressor`, `TheilSenRegressor`, `KernelRidge`, `TransformedTargetRegressor`, `SgdClassifier`, `SgdRegressor`, `PassiveAggressiveClassifier`, `PassiveAggressiveRegressor`	`sklearn.linear_model`
GLMs	`PoissonRegressor`, `BinomialRegressor`, `TweedieRegressor`, `GammaRegressor`	`sklearn.linear_model` GLM family
Discriminant analysis	`LinearDiscriminantAnalysis` (+`transform`), `QuadraticDiscriminantAnalysis`	`sklearn.discriminant_analysis`
Trees	`DecisionTreeClassifier` (+`predict_proba`), `DecisionTreeRegressor`	`sklearn.tree`
Ensemble	`RandomForest{Classifier,Regressor}`, `ExtraTrees{Classifier,Regressor}`, `GradientBoosting{Classifier,Regressor}`, `HistGradientBoosting{Classifier,Regressor}`, `LgbmClassifier`, `LgbmRegressor`, `Bagging{Classifier,Regressor}`, `AdaBoost{Classifier,Regressor}`, `Voting{Classifier,Regressor}`, `Stacking{Classifier,Regressor}` (+`push_proba`), `CalibratedClassifierCV`, `IsolationForest` (rayon-parallel)	`sklearn.ensemble`, `lightgbm`
Clustering	`KMeans`, `MiniBatchKMeans`, `Dbscan`, `Hdbscan`, `Optics`, `Birch`, `AgglomerativeClustering` (Ward/single/complete/average), `SpectralClustering`, `MeanShift`, `AffinityPropagation`, `GaussianMixture`, `BayesianGaussianMixture`	`sklearn.cluster`, `sklearn.mixture`
Naive Bayes	`GaussianNB`, `MultinomialNB`, `BernoulliNB` (all with `predict_proba`)	`sklearn.naive_bayes`
SVM	`Svc`, `Svr`, `NuSvc`, `NuSvr`, `LinearSvc`, `LinearSvr`, `OneClassSvm`	`sklearn.svm`
Gaussian processes	`GaussianProcessRegressor` (RBF, Matern, RationalQuadratic, White, Constant, sums/products), `GaussianProcessClassifier` (Laplace)	`sklearn.gaussian_process`
Neural networks	`MlpClassifier`, `MlpRegressor`	`sklearn.neural_network`
Multi-output	`MultiOutputRegressor`, `MultiOutputClassifier`, `RegressorChain`, `ClassifierChain`	`sklearn.multioutput`
Text	`CountVectorizer`, `TfidfVectorizer`, `HashingVectorizer`	`sklearn.feature_extraction.text`
Inspection	`permutation_importance` (rayon-parallel)	`sklearn.inspection`
Metrics	`accuracy_score`, `precision`, `recall`, `f1_score`, `confusion_matrix`, `roc_auc_score`, `roc_curve`, `precision_recall_curve`, `log_loss`, `brier_score_loss`, `matthews_corrcoef`, `cohen_kappa_score`, `silhouette_score`, `adjusted_rand_score`, `mse`, `mae`, `r2_score`, `mean_absolute_percentage_error`, `median_absolute_error`, `mean_squared_log_error`, ...	`sklearn.metrics`
Utilities	`train_test_split`, `cross_val_score`, `cross_validate`, `grid_search_cv`, `randomized_search_cv`, `halving_grid_search_cv`, `halving_random_search_cv`, `k_fold` (+ stratified / group / time-series / shuffle / leave-one-out), `learning_curve`, `validation_curve`, `Pipeline`, `ColumnTransformer`, `FunctionTransformer`, `FeatureUnion`	`sklearn.model_selection`, `sklearn.pipeline`, `sklearn.compose`
I/O & persistence	CSV reader with ndarray integration, JSON / bincode serde round-tripping for fitted models	`pandas.read_csv`, `joblib.dump`

Quick Start

Add anofox-ml to your project:

[dependencies]
anofox-ml = "0.1"
ndarray = "0.16"

Train a KNN classifier with standardized features:

use anofox_ml::prelude::*;
use ndarray::array;

fn main() -> anofox_ml::core::Result<()> {
    // Sample data
    let x_train = array![[1.0, 2.0], [2.0, 3.0], [3.0, 4.0],
                          [8.0, 9.0], [9.0, 10.0], [10.0, 11.0]];
    let y_train = array![0.0, 0.0, 0.0, 1.0, 1.0, 1.0];

    // Scale features
    let scaler = StandardScaler::new().fit(&x_train)?;
    let x_scaled = scaler.transform(&x_train)?;

    // Fit KNN classifier
    let knn = KnnClassifier::new(3);
    let model = knn.fit(&x_scaled, &y_train)?;

    // Predict and evaluate
    let x_test = array![[2.0, 3.0], [9.0, 10.0]];
    let x_test_scaled = scaler.transform(&x_test)?;
    let predictions = model.predict(&x_test_scaled)?;

    let acc = accuracy_score(&array![0.0, 1.0], &predictions);
    println!("Accuracy: {acc:.2}");

    Ok(())
}

Architecture

anofox-ml is organized as a Cargo workspace with focused crates. You can depend on the umbrella anofox-ml crate for everything, or pick individual crates for smaller dependency trees.

anofox-ml (facade)
  +-- anofox-ml-core              Core traits, error types, Pipeline, utilities
  +-- anofox-ml-metrics           Classification, regression, clustering metrics
  +-- anofox-ml-preprocessing     Scalers, PCA, KernelPCA, NMF, FastICA, TruncatedSVD,
                               PLS, CCA, feature selection, RFE/RFECV/SFS
  +-- anofox-ml-neighbors         KNN with KD-tree, LocalOutlierFactor
  +-- anofox-ml-trees             CART decision trees with predict_proba
  +-- anofox-ml-ensemble          Random Forest, ExtraTrees, Gradient Boosting,
                               HistGradientBoosting, LightGBM-lite, AdaBoost,
                               Bagging, Voting, Stacking, Calibrated, IsolationForest
  +-- anofox-ml-cluster           KMeans, MiniBatchKMeans, DBSCAN, HDBSCAN, OPTICS,
                               Birch, Agglomerative, Spectral, MeanShift, AP,
                               GaussianMixture, BayesianGaussianMixture
  +-- anofox-ml-naive-bayes       Gaussian/Multinomial/Bernoulli NB
  +-- anofox-ml-discriminant      LDA (with transform) and QDA
  +-- anofox-ml-svm               SVC, SVR, NuSVC, NuSVR, LinearSVC/SVR, OneClassSVM
  +-- anofox-ml-regression        OLS, Ridge (+weighted), Lasso, ElasticNet, GLMs,
                               BayesianRidge, ARD, LARS, OMP, KernelRidge,
                               RANSAC, TheilSen, Tweedie, TransformedTarget
  +-- anofox-ml-linear            SGD, PassiveAggressive
  +-- anofox-ml-gaussian-process  GP regressor (5 kernels + composites) & classifier
  +-- anofox-ml-manifold          ClassicalMDS, Isomap, LLE, t-SNE
  +-- anofox-ml-neural-networks   MLPClassifier, MLPRegressor
  +-- anofox-ml-text              Count/Tfidf/Hashing vectorizers
  +-- anofox-ml-io                CSV loading

Type-state pattern

Estimators use a compile-time type-state pattern to separate unfitted parameters from fitted models. Calling fit() on an unfitted struct returns a distinct Fitted* type that implements Predict or Transform. This makes it a compile error to call predict() on an unfitted estimator.

KnnClassifier --fit()--> FittedKnnClassifier --predict()--> Array1
StandardScaler --fit()--> FittedStandardScaler --transform()--> Array2

Core traits

Trait	Purpose
`Fit<F>`	Supervised fitting: `fit(&self, x, y) -> Fitted`
`FitUnsupervised<F>`	Unsupervised fitting: `fit(&self, x) -> Fitted`
`FitWeighted<F>`	Supervised fitting with per-sample `sample_weight`
`Predict<F>`	Generate predictions from fitted model
`PredictProba<F>`	Class probabilities; rows sum to 1
`PredictLogProba<F>`	Log of `predict_proba` (auto-derived)
`DecisionFunction<F>`	Real-valued per-class decision scores
`RegressorScore<F>` / `ClassifierScore<F>`	`score()` (R² / accuracy)
`Transform<F>`	Transform feature matrix
`InverseTransform<F>`	Reverse a transformation

sklearn parity

Every estimator in anofox-ml is validated against scikit-learn 1.8.0 via golden fixtures (test_harness/generators/gen_*.py) and corresponding Rust tests in crates/anofox-ml/tests/golden_*.rs. Per-estimator parity notes — including tolerances, sample-weight behaviour, missing options, and asymptotic complexity — live under validation/sklearn_parity/.

The pinned sklearn version (1.8.0) is enforced by test_harness/check_sklearn_version.py. Bumping the pin requires ./test_harness/regenerate_all.sh followed by a full cargo test --workspace to confirm tolerances still hold.

Algorithms

See the feature table above for the full list. New since the original release:

Linear models: BayesianRidge / ARDRegression, LARS / LassoLars / LassoLarsIC, OrthogonalMatchingPursuit, KernelRidge (with sample_weight), Tweedie / Gamma GLMs, TransformedTargetRegressor, PassiveAggressive, RANSAC, TheilSen, Ridge with sample_weight.
Cluster: MiniBatchKMeans, AgglomerativeClustering (4 linkages), SpectralClustering, MeanShift, AffinityPropagation, Birch, OPTICS, HDBSCAN, GaussianMixture, BayesianGaussianMixture.
Decomposition / manifold: TruncatedSVD, KernelPCA, NMF (NNDSVD), FastICA, ClassicalMDS, Isomap, LocallyLinearEmbedding, t-SNE.
Cross-decomposition: PLSRegression, CCA.
Discriminant: LinearDiscriminantAnalysis (with transform), QDA.
Gaussian processes: regressor (5 kernels + composites) and Laplace classifier.
Outlier detection: IsolationForest (rayon-parallel), LocalOutlierFactor.
Meta-estimators: MultiOutputRegressor/Classifier, RegressorChain, ClassifierChain, StackingClassifier (with predict_proba), CalibratedClassifierCV.
Feature selection: RFE, RFECV, SequentialFeatureSelector (forward).
Inspection: permutation_importance (rayon-parallel).
Search: halving_grid_search_cv, halving_random_search_cv.
Text: CountVectorizer, TfidfVectorizer, HashingVectorizer.

Metrics

Classification: accuracy_score, precision, recall, f1_score, confusion_matrix, macro/micro/weighted averaging
Regression: mse, mae, r2_score

Utilities

train_test_split, cross_val_score, Pipeline

Benchmarks

anofox-ml outperforms scikit-learn across all benchmarks, with up to 22x speedups on critical operations. Measurements taken on the same machine with identical datasets and parameters.

Algorithm	Operation	sklearn (ms)	anofox-ml (ms)	Speedup
GaussianNB	fit 5000×20	6.34	0.29	21.8x
DecisionTree	predict 5000×20	0.10	0.007	14.6x
KNN	predict 1000×50	6.34	0.73	8.7x
KMeans	fit 5000×20	114.16	20.51	5.6x
StandardScaler	fit+transform 1000×50	0.59	0.15	3.9x
StandardScaler	fit+transform 10000×100	6.78	3.11	2.2x
RandomForest	fit 5000×20	1039.67	511.20	2.0x
RandomForest	predict 5000×20	5.93	3.82	1.6x
DecisionTree	fit 5000×20	78.45	59.95	1.3x
GaussianNB	predict 5000×20	0.31	0.23	1.3x
KNN	fit 1000×50	0.31	0.29	1.1x

Key optimizations: incremental sorted-index split finding for decision trees, BinaryHeap-based KD-tree pruning for KNN, vectorized distance computation with rayon parallelism for KMeans, and batch prediction for Random Forest.

Reproduce with:

cargo bench -p anofox-ml
uv run benchmarks/compare.py

Additional estimator families (not in the perf sweep)

The benchmark table above covers the load-bearing fast paths. The following families ship with correctness-only validation against sklearn and are not part of the head-to-head perf sweep — they target API parity, not throughput records:

Clustering: HDBSCAN, AffinityPropagation, MeanShift, Birch, OPTICS, SpectralClustering, BayesianGaussianMixture
Manifold: Isomap, LocallyLinearEmbedding, t-SNE (exact and Barnes-Hut), ClassicalMDS
Decomposition / FE: FastICA, CCA, KernelPCA + inverse_transform, TruncatedSVD, NMF, PLS, RFECV
Neighbors: LocalOutlierFactor (KD-tree path)
GP: GaussianProcessClassifier (Laplace), kernel zoo (Matern, RQ, White, Constant, sums/products), L-BFGS multi-parameter optimisation

Algorithmic complexity tables for each estimator live in validation/sklearn_parity/*.md.

Documentation

API documentation is published at docs.rs/anofox-ml.

Contributing

Contributions are welcome. Please open an issue to discuss proposed changes before submitting a pull request. All code should include tests and pass cargo clippy and cargo fmt --check.

Releasing

The workspace is split into 17 publishable crates plus an umbrella anofox-ml. Versions are kept in lockstep via [workspace.package].version in the root Cargo.toml — bumping there moves every crate at once.

Releases publish through GitHub Actions, triggered by a published GitHub release. To cut a release:

# 1. Bump workspace.package.version in Cargo.toml (e.g. 0.1.0 → 0.2.0)
# 2. Bump workspace.dependencies.anofox-ml-* `version = "..."` entries to match
# 3. Commit, tag, push
git commit -am "Release v0.2.0"
git tag v0.2.0
git push && git push --tags

# 4. Create a GitHub release pointing at the tag.
#    The `publish` job in .github/workflows/ci.yml fires on
#    `release.types: [published]`, runs after the full test matrix, and
#    invokes scripts/publish.sh --execute.

Required secret: CARGO_REGISTRY_TOKEN (a crates.io API token with publish scope) configured in the GitHub repo settings. The workflow also asserts the release tag matches workspace.package.version before uploading anything, so a misnamed tag fails fast.

anofox-ml-python is marked publish = false — it's a PyO3 extension module distributed via maturin (PyPI), not crates.io.

To dry-run locally before tagging:

scripts/publish.sh           # dry-run, no uploads
scripts/publish.sh --execute # requires `cargo login` against crates.io

License

Licensed under either of

at your option.

anofox-ml-text 0.1.0