anofox-ml
A scikit-learn-inspired machine learning library for Rust, built on ndarray.
Features
| Category | anofox-ml | scikit-learn equivalent |
|---|---|---|
| Preprocessing & scaling | StandardScaler, MinMaxScaler, MaxAbsScaler, RobustScaler, Normalizer, Binarizer, KBinsDiscretizer, PolynomialFeatures, PowerTransformer, QuantileTransformer, SimpleImputer, OneHotEncoder, OrdinalEncoder, LabelEncoder |
sklearn.preprocessing, sklearn.impute |
| Decomposition | Pca, TruncatedSvd, KernelPca, Nmf (NNDSVD), FastIca |
sklearn.decomposition |
| Cross-decomposition | PlsRegression (PLS1), Cca |
sklearn.cross_decomposition |
| Manifold learning | ClassicalMds, Isomap, LocallyLinearEmbedding, TSne |
sklearn.manifold |
| Feature selection | VarianceThreshold, MutualInformationSelector, SelectKBest, SelectFromModel, Rfe, Rfecv, SequentialFeatureSelector |
sklearn.feature_selection |
| Neighbors | KnnClassifier, KnnRegressor (KD-tree), LocalOutlierFactor |
sklearn.neighbors |
| Linear models | OlsRegressor, RidgeRegressor (+CV, +sample_weight), LassoRegressor (+CV), ElasticNetRegressor (+CV), HuberRegressor, QuantileRegressor, IsotonicRegressor, WlsRegressor, LogisticRegressor, BayesianRidge, ARDRegression, Lars, LassoLarsIC, OrthogonalMatchingPursuit, RansacRegressor, TheilSenRegressor, KernelRidge, TransformedTargetRegressor, SgdClassifier, SgdRegressor, PassiveAggressiveClassifier, PassiveAggressiveRegressor |
sklearn.linear_model |
| GLMs | PoissonRegressor, BinomialRegressor, TweedieRegressor, GammaRegressor |
sklearn.linear_model GLM family |
| Discriminant analysis | LinearDiscriminantAnalysis (+transform), QuadraticDiscriminantAnalysis |
sklearn.discriminant_analysis |
| Trees | DecisionTreeClassifier (+predict_proba), DecisionTreeRegressor |
sklearn.tree |
| Ensemble | RandomForest{Classifier,Regressor}, ExtraTrees{Classifier,Regressor}, GradientBoosting{Classifier,Regressor}, HistGradientBoosting{Classifier,Regressor}, LgbmClassifier, LgbmRegressor, Bagging{Classifier,Regressor}, AdaBoost{Classifier,Regressor}, Voting{Classifier,Regressor}, Stacking{Classifier,Regressor} (+push_proba), CalibratedClassifierCV, IsolationForest (rayon-parallel) |
sklearn.ensemble, lightgbm |
| Clustering | KMeans, MiniBatchKMeans, Dbscan, Hdbscan, Optics, Birch, AgglomerativeClustering (Ward/single/complete/average), SpectralClustering, MeanShift, AffinityPropagation, GaussianMixture, BayesianGaussianMixture |
sklearn.cluster, sklearn.mixture |
| Naive Bayes | GaussianNB, MultinomialNB, BernoulliNB (all with predict_proba) |
sklearn.naive_bayes |
| SVM | Svc, Svr, NuSvc, NuSvr, LinearSvc, LinearSvr, OneClassSvm |
sklearn.svm |
| Gaussian processes | GaussianProcessRegressor (RBF, Matern, RationalQuadratic, White, Constant, sums/products), GaussianProcessClassifier (Laplace) |
sklearn.gaussian_process |
| Neural networks | MlpClassifier, MlpRegressor |
sklearn.neural_network |
| Multi-output | MultiOutputRegressor, MultiOutputClassifier, RegressorChain, ClassifierChain |
sklearn.multioutput |
| Text | CountVectorizer, TfidfVectorizer, HashingVectorizer |
sklearn.feature_extraction.text |
| Inspection | permutation_importance (rayon-parallel) |
sklearn.inspection |
| Metrics | accuracy_score, precision, recall, f1_score, confusion_matrix, roc_auc_score, roc_curve, precision_recall_curve, log_loss, brier_score_loss, matthews_corrcoef, cohen_kappa_score, silhouette_score, adjusted_rand_score, mse, mae, r2_score, mean_absolute_percentage_error, median_absolute_error, mean_squared_log_error, ... |
sklearn.metrics |
| Utilities | train_test_split, cross_val_score, cross_validate, grid_search_cv, randomized_search_cv, halving_grid_search_cv, halving_random_search_cv, k_fold (+ stratified / group / time-series / shuffle / leave-one-out), learning_curve, validation_curve, Pipeline, ColumnTransformer, FunctionTransformer, FeatureUnion |
sklearn.model_selection, sklearn.pipeline, sklearn.compose |
| I/O & persistence | CSV reader with ndarray integration, JSON / bincode serde round-tripping for fitted models | pandas.read_csv, joblib.dump |
Quick Start
Add anofox-ml to your project:
[]
= "0.1"
= "0.16"
Train a KNN classifier with standardized features:
use *;
use array;
Architecture
anofox-ml is organized as a Cargo workspace with focused crates. You can depend on
the umbrella anofox-ml crate for everything, or pick individual crates for
smaller dependency trees.
anofox-ml (facade)
+-- anofox-ml-core Core traits, error types, Pipeline, utilities
+-- anofox-ml-metrics Classification, regression, clustering metrics
+-- anofox-ml-preprocessing Scalers, PCA, KernelPCA, NMF, FastICA, TruncatedSVD,
PLS, CCA, feature selection, RFE/RFECV/SFS
+-- anofox-ml-neighbors KNN with KD-tree, LocalOutlierFactor
+-- anofox-ml-trees CART decision trees with predict_proba
+-- anofox-ml-ensemble Random Forest, ExtraTrees, Gradient Boosting,
HistGradientBoosting, LightGBM-lite, AdaBoost,
Bagging, Voting, Stacking, Calibrated, IsolationForest
+-- anofox-ml-cluster KMeans, MiniBatchKMeans, DBSCAN, HDBSCAN, OPTICS,
Birch, Agglomerative, Spectral, MeanShift, AP,
GaussianMixture, BayesianGaussianMixture
+-- anofox-ml-naive-bayes Gaussian/Multinomial/Bernoulli NB
+-- anofox-ml-discriminant LDA (with transform) and QDA
+-- anofox-ml-svm SVC, SVR, NuSVC, NuSVR, LinearSVC/SVR, OneClassSVM
+-- anofox-ml-regression OLS, Ridge (+weighted), Lasso, ElasticNet, GLMs,
BayesianRidge, ARD, LARS, OMP, KernelRidge,
RANSAC, TheilSen, Tweedie, TransformedTarget
+-- anofox-ml-linear SGD, PassiveAggressive
+-- anofox-ml-gaussian-process GP regressor (5 kernels + composites) & classifier
+-- anofox-ml-manifold ClassicalMDS, Isomap, LLE, t-SNE
+-- anofox-ml-neural-networks MLPClassifier, MLPRegressor
+-- anofox-ml-text Count/Tfidf/Hashing vectorizers
+-- anofox-ml-io CSV loading
Type-state pattern
Estimators use a compile-time type-state pattern to separate unfitted
parameters from fitted models. Calling fit() on an unfitted struct returns a
distinct Fitted* type that implements Predict or Transform. This makes it
a compile error to call predict() on an unfitted estimator.
KnnClassifier --fit()--> FittedKnnClassifier --predict()--> Array1
StandardScaler --fit()--> FittedStandardScaler --transform()--> Array2
Core traits
| Trait | Purpose |
|---|---|
Fit<F> |
Supervised fitting: fit(&self, x, y) -> Fitted |
FitUnsupervised<F> |
Unsupervised fitting: fit(&self, x) -> Fitted |
FitWeighted<F> |
Supervised fitting with per-sample sample_weight |
Predict<F> |
Generate predictions from fitted model |
PredictProba<F> |
Class probabilities; rows sum to 1 |
PredictLogProba<F> |
Log of predict_proba (auto-derived) |
DecisionFunction<F> |
Real-valued per-class decision scores |
RegressorScore<F> / ClassifierScore<F> |
score() (R² / accuracy) |
Transform<F> |
Transform feature matrix |
InverseTransform<F> |
Reverse a transformation |
sklearn parity
Every estimator in anofox-ml is validated against scikit-learn 1.8.0 via golden
fixtures (test_harness/generators/gen_*.py) and corresponding Rust tests in
crates/anofox-ml/tests/golden_*.rs. Per-estimator parity notes — including
tolerances, sample-weight behaviour, missing options, and asymptotic
complexity — live under validation/sklearn_parity/.
The pinned sklearn version (1.8.0) is enforced by
test_harness/check_sklearn_version.py. Bumping the pin requires
./test_harness/regenerate_all.sh followed by a full cargo test --workspace
to confirm tolerances still hold.
Algorithms
See the feature table above for the full list. New since the original release:
- Linear models: BayesianRidge / ARDRegression, LARS / LassoLars /
LassoLarsIC, OrthogonalMatchingPursuit, KernelRidge (with sample_weight),
Tweedie / Gamma GLMs, TransformedTargetRegressor, PassiveAggressive,
RANSAC, TheilSen, Ridge with
sample_weight. - Cluster: MiniBatchKMeans, AgglomerativeClustering (4 linkages), SpectralClustering, MeanShift, AffinityPropagation, Birch, OPTICS, HDBSCAN, GaussianMixture, BayesianGaussianMixture.
- Decomposition / manifold: TruncatedSVD, KernelPCA, NMF (NNDSVD), FastICA, ClassicalMDS, Isomap, LocallyLinearEmbedding, t-SNE.
- Cross-decomposition: PLSRegression, CCA.
- Discriminant: LinearDiscriminantAnalysis (with
transform), QDA. - Gaussian processes: regressor (5 kernels + composites) and Laplace classifier.
- Outlier detection: IsolationForest (rayon-parallel), LocalOutlierFactor.
- Meta-estimators: MultiOutputRegressor/Classifier, RegressorChain,
ClassifierChain, StackingClassifier (with
predict_proba), CalibratedClassifierCV. - Feature selection: RFE, RFECV, SequentialFeatureSelector (forward).
- Inspection:
permutation_importance(rayon-parallel). - Search:
halving_grid_search_cv,halving_random_search_cv. - Text: CountVectorizer, TfidfVectorizer, HashingVectorizer.
Metrics
- Classification:
accuracy_score,precision,recall,f1_score,confusion_matrix, macro/micro/weighted averaging - Regression:
mse,mae,r2_score
Utilities
train_test_split,cross_val_score,Pipeline
Benchmarks
anofox-ml outperforms scikit-learn across all benchmarks, with up to 22x speedups on critical operations. Measurements taken on the same machine with identical datasets and parameters.
| Algorithm | Operation | sklearn (ms) | anofox-ml (ms) | Speedup |
|---|---|---|---|---|
| GaussianNB | fit 5000×20 | 6.34 | 0.29 | 21.8x |
| DecisionTree | predict 5000×20 | 0.10 | 0.007 | 14.6x |
| KNN | predict 1000×50 | 6.34 | 0.73 | 8.7x |
| KMeans | fit 5000×20 | 114.16 | 20.51 | 5.6x |
| StandardScaler | fit+transform 1000×50 | 0.59 | 0.15 | 3.9x |
| StandardScaler | fit+transform 10000×100 | 6.78 | 3.11 | 2.2x |
| RandomForest | fit 5000×20 | 1039.67 | 511.20 | 2.0x |
| RandomForest | predict 5000×20 | 5.93 | 3.82 | 1.6x |
| DecisionTree | fit 5000×20 | 78.45 | 59.95 | 1.3x |
| GaussianNB | predict 5000×20 | 0.31 | 0.23 | 1.3x |
| KNN | fit 1000×50 | 0.31 | 0.29 | 1.1x |
Key optimizations: incremental sorted-index split finding for decision trees, BinaryHeap-based KD-tree pruning for KNN, vectorized distance computation with rayon parallelism for KMeans, and batch prediction for Random Forest.
Reproduce with:
Additional estimator families (not in the perf sweep)
The benchmark table above covers the load-bearing fast paths. The following families ship with correctness-only validation against sklearn and are not part of the head-to-head perf sweep — they target API parity, not throughput records:
- Clustering: HDBSCAN, AffinityPropagation, MeanShift, Birch, OPTICS, SpectralClustering, BayesianGaussianMixture
- Manifold: Isomap, LocallyLinearEmbedding, t-SNE (exact and Barnes-Hut), ClassicalMDS
- Decomposition / FE: FastICA, CCA, KernelPCA + inverse_transform, TruncatedSVD, NMF, PLS, RFECV
- Neighbors: LocalOutlierFactor (KD-tree path)
- GP: GaussianProcessClassifier (Laplace), kernel zoo (Matern, RQ, White, Constant, sums/products), L-BFGS multi-parameter optimisation
Algorithmic complexity tables for each estimator live in
validation/sklearn_parity/*.md.
Documentation
API documentation is published at docs.rs/anofox-ml.
Contributing
Contributions are welcome. Please open an issue to discuss proposed changes
before submitting a pull request. All code should include tests and pass
cargo clippy and cargo fmt --check.
Releasing
The workspace is split into 17 publishable crates plus an umbrella anofox-ml.
Versions are kept in lockstep via [workspace.package].version in the root
Cargo.toml — bumping there moves every crate at once.
Releases publish through GitHub Actions, triggered by a published GitHub release. To cut a release:
# 1. Bump workspace.package.version in Cargo.toml (e.g. 0.1.0 → 0.2.0)
# 2. Bump workspace.dependencies.anofox-ml-* `version = "..."` entries to match
# 3. Commit, tag, push
&&
# 4. Create a GitHub release pointing at the tag.
# The `publish` job in .github/workflows/ci.yml fires on
# `release.types: [published]`, runs after the full test matrix, and
# invokes scripts/publish.sh --execute.
Required secret: CARGO_REGISTRY_TOKEN (a crates.io API token with publish
scope) configured in the GitHub repo settings. The workflow also asserts
the release tag matches workspace.package.version before uploading
anything, so a misnamed tag fails fast.
anofox-ml-python is marked publish = false — it's a PyO3 extension
module distributed via maturin (PyPI), not crates.io.
To dry-run locally before tagging:
License
Licensed under either of
at your option.