umap-rs
Fast, parallel Rust implementation of the core UMAP algorithm. Clean, modern Rust implementation focused on performance and correctness.
What this is
- Core UMAP dimensionality reduction algorithm
- Parallel SGD optimization via Rayon
- Extensible metric system (Euclidean + custom metrics)
- Checkpointing and fault-tolerant training
- Dense arrays only (no sparse matrix support)
- Fit only (transform for new points not yet implemented)
What this is NOT
- Drop-in replacement for Python umap-learn
- General-purpose ML library
- Production-ready with validation/error handling
See DIVERGENCES.md for detailed comparison to Python umap-learn.
Usage
use Array2;
use ;
// Configure UMAP parameters
let config = UmapConfig ;
// Create UMAP instance
let umap = new;
// Precompute KNN (use your favorite ANN library: pynndescent, hnswlib, etc.)
let knn_indices: = /* shape (n_samples, n_neighbors) */;
let knn_dists: = /* shape (n_samples, n_neighbors) */;
// Provide initialization (see Initialization section below)
// Common choices: random, PCA, or your own custom embedding
let init: = /* shape (n_samples, n_components) */;
// Fit UMAP to data
let model = umap.fit;
// Get the embedding
let embedding = model.embedding; // Returns ArrayView2<f32>
// Or take ownership of the embedding
let embedding = model.into_embedding; // Returns Array2<f32>
Checkpointing
UMAP training has two phases: learning the manifold (building the graph) and optimizing the embedding (running gradient descent). The first is deterministic and expensive, the second is iterative and can be interrupted.
For long training runs, you can checkpoint the optimization and resume if interrupted:
use ;
// Phase 1: Learn the manifold structure from your data
// This builds the fuzzy topological graph. It's slow but deterministic -
// same inputs always give same manifold.
let manifold = umap.learn_manifold;
// Phase 2: Optimize the embedding via gradient descent
// Create an optimizer that will run 500 epochs of SGD
let metric = EuclideanMetric;
let mut opt = new;
// Train in chunks of 10 epochs at a time
while opt.remaining_epochs > 0
// Training done - convert to final lightweight model
let fitted = opt.into_fitted;
If your process is interrupted, load the checkpoint and continue:
// Deserialize the optimizer state from disk
let mut opt: Optimizer = deserialize?;
// Continue from epoch 250 to 500
while opt.remaining_epochs > 0
let fitted = opt.into_fitted;
The checkpoint contains everything: current embedding, epoch counters, and the manifold. When training completes, convert to a final model:
// Training done - drop the heavy optimization state
let fitted = opt.into_fitted;
// Access the embedding
let embedding = fitted.embedding; // Zero-copy view
// Or take ownership
let embedding = fitted.into_embedding;
// Save the final model (much smaller than checkpoints)
write?;
The serialized FittedUmap contains just the manifold and embedding, not the optimization state, making it lightweight for long-term storage.
Initialization
You must provide your own initialization. This library is designed to be minimal and focused on the core UMAP optimization - initialization is left to the caller.
Recommended Approaches
Random initialization (simplest):
use Array2;
use Rng;
PCA initialization (recommended for better convergence):
// Use any PCA library (e.g., linfa-reduction, ndarray-stats, etc.)
// Project data to first n_components principal components
// Scale to roughly [-10, 10] range
Custom initialization:
- Spectral embedding (use sparse eigensolvers like arpack-ng for large datasets)
- t-SNE initialization
- Pre-trained neural network embeddings
- Domain-specific embeddings
Configuration
UMAP parameters are grouped logically:
Basic
use UmapConfig;
let config = UmapConfig ;
Manifold parameters
use ManifoldParams;
let manifold = ManifoldParams ;
Graph construction
use GraphParams;
let graph = GraphParams ;
Optimization
use OptimizationParams;
let optimization = OptimizationParams ;
Complete example
let config = UmapConfig ;
Custom distance metrics
Implement the Metric trait:
use Metric;
use ;
;
// Use custom metric
let umap = with_metrics;
See src/distances.rs for the Euclidean implementation example.
Build
Performance
- Parallel graph construction: Fuzzy simplicial set computation is fully parallelized via Rayon
- Parallel SGD: Enabled by default via Rayon
- Hogwild! algorithm: Lock-free parallel gradient descent
Documentation
- UMAP.md - How UMAP works (algorithm explanation)
- DIVERGENCES.md - Differences from Python umap-learn
- AGENTS.md - Developer notes
Run cargo doc --open to browse the API documentation.
Design principles
- Minimal - Core algorithm only, no feature creep
- Fast - Parallel by default, zero-copy where possible
- Explicit - Caller provides KNN, initialization, etc.
- Rust-native - Idiomatic patterns, not Python translations
Limitations
- No input validation (assumes clean data)
- Transform not yet implemented
- Dense arrays only
- Panics on invalid input (not Result-based errors)
- Requires external KNN computation and initialization
This is a specialized tool for the core algorithm. Wrap it in validation/error handling for production use.
License
BSD-3-Clause (see LICENSE file)
References
- Original paper: McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv:1802.03426
- Hogwild! SGD: Recht, B., et al. (2011). Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. NIPS 2011
- Python umap-learn: https://github.com/lmcinnes/umap