# umap-rs
Fast, parallel Rust implementation of the core UMAP algorithm. Clean, modern Rust implementation focused on performance and correctness.
Scales to 264 million samples in 2 hours on a 126-core machine (see [Performance](#performance)).
## What this is
- Core UMAP dimensionality reduction algorithm
- Fully parallelized via Rayon with memory-efficient sparse matrix construction
- Extensible metric system (Euclidean + custom metrics)
- Checkpointing and fault-tolerant training
- Dense arrays only (no sparse matrix support)
- Fit only (transform for new points not yet implemented)
See [DIVERGENCES.md](DIVERGENCES.md) for detailed comparison to Python umap-learn.
## Usage
```rust
use ndarray::Array2;
use umap::{GraphParams, Umap, UmapConfig};
// Configure UMAP parameters
let config = UmapConfig {
n_components: 2,
graph: GraphParams {
n_neighbors: 15,
..Default::default()
},
..Default::default()
};
// Create UMAP instance
let umap = Umap::new(config);
// Precompute KNN (use your favorite ANN library: pynndescent, hnswlib, etc.)
let knn_indices: Array2<u32> = /* shape (n_samples, n_neighbors) */;
let knn_dists: Array2<f32> = /* shape (n_samples, n_neighbors) */;
// Provide initialization (see Initialization section below)
// Common choices: random, PCA, or your own custom embedding
let init: Array2<f32> = /* shape (n_samples, n_components) */;
// Fit UMAP to data
let model = umap.fit(
data.view(),
knn_indices.view(),
knn_dists.view(),
init.view(),
);
// Get the embedding
let embedding = model.embedding(); // Returns ArrayView2<f32>
// Or take ownership of the embedding
let embedding = model.into_embedding(); // Returns Array2<f32>
```
### Checkpointing
UMAP training has two phases: **learning the manifold** (building the graph) and **optimizing the embedding** (running gradient descent). The first is deterministic and expensive, the second is iterative and can be interrupted.
For long training runs, you can checkpoint the optimization and resume if interrupted:
```rust
use umap_rs::{Metric, Optimizer};
// Phase 1: Learn the manifold structure from your data
// This builds the fuzzy topological graph. It's slow but deterministic -
// same inputs always give same manifold.
let manifold = umap.learn_manifold(data.view(), knn_indices.view(), knn_dists.view());
// Phase 2: Optimize the embedding via gradient descent
// Create an optimizer that will run 500 epochs of SGD
let metric = umap_rs::EuclideanMetric;
let mut opt = Optimizer::new(manifold, init, 500, &config, metric.metric_type());
// Train in chunks of 10 epochs at a time
while opt.remaining_epochs() > 0 {
opt.step_epochs(10, &metric); // Run 10 more epochs
// Periodically save a checkpoint (embedding + all optimization state)
if opt.current_epoch() % 50 == 0 {
std::fs::write(
format!("checkpoint_{}.bin", opt.current_epoch()),
bincode::serialize(&opt)?
)?;
}
}
// Training done - convert to final lightweight model
let fitted = opt.into_fitted(config);
```
If your process is interrupted, load the checkpoint and continue:
```rust
// Deserialize the optimizer state from disk
let mut opt: Optimizer = bincode::deserialize(&std::fs::read("checkpoint_250.bin")?)?;
// Continue from epoch 250 to 500
while opt.remaining_epochs() > 0 {
opt.step_epochs(10, &metric);
}
let fitted = opt.into_fitted(config);
```
The checkpoint contains everything: current embedding, epoch counters, and the manifold. When training completes, convert to a final model:
```rust
// Training done - drop the heavy optimization state
let fitted = opt.into_fitted(config);
// Access the embedding
let embedding = fitted.embedding(); // Zero-copy view
// Or take ownership
let embedding = fitted.into_embedding();
// Save the final model (much smaller than checkpoints)
std::fs::write("model.bin", bincode::serialize(&fitted)?)?;
```
The serialized `FittedUmap` contains just the manifold and embedding, not the optimization state, making it lightweight for long-term storage.
## Initialization
**You must provide your own initialization.** This library is designed to be minimal and focused on the core UMAP optimization - initialization is left to the caller.
### Recommended Approaches
**Random initialization** (simplest):
```rust
use ndarray::Array2;
use rand::Rng;
fn random_init(n_samples: usize, n_components: usize) -> Array2<f32> {
let mut rng = rand::thread_rng();
- Spectral embedding (use sparse eigensolvers like arpack-ng for large datasets)
- t-SNE initialization
- Pre-trained neural network embeddings
- Domain-specific embeddings
## Configuration
UMAP parameters are grouped logically:
### Basic
```rust
use umap::UmapConfig;
let config = UmapConfig {
n_components: 2, // Output dimensions
..Default::default()
};
```
### Manifold parameters
```rust
use umap::config::ManifoldParams;
let manifold = ManifoldParams {
min_dist: 0.1, // Minimum distance in embedding
spread: 1.0, // Scale of embedding
a: None, // Auto-computed from min_dist/spread
b: None, // Auto-computed from min_dist/spread
};
```
### Graph construction
```rust
use umap::config::GraphParams;
let graph = GraphParams {
n_neighbors: 15, // Number of nearest neighbors
local_connectivity: 1.0, // Local neighborhood connectivity
set_op_mix_ratio: 1.0, // Fuzzy union (1.0) vs intersection (0.0)
disconnection_distance: None, // Auto-computed from metric
symmetrize: true, // Symmetrize graph (set false to save memory)
};
```
The `symmetrize` option controls whether the fuzzy graph is symmetrized via fuzzy set union. For very large datasets, setting `symmetrize: false` roughly halves memory usage with minimal impact on 2D visualization quality.
### Optimization
```rust
use umap::config::OptimizationParams;
let optimization = OptimizationParams {
n_epochs: None, // Auto-determined from dataset size
learning_rate: 1.0, // SGD learning rate
negative_sample_rate: 5, // Negative samples per positive
repulsion_strength: 1.0, // Weight for negative samples
};
```
### Complete example
```rust
let config = UmapConfig {
n_components: 3,
manifold: ManifoldParams {
min_dist: 0.05,
..Default::default()
},
graph: GraphParams {
n_neighbors: 30,
..Default::default()
},
optimization: OptimizationParams {
n_epochs: Some(500),
..Default::default()
},
};
```
## Custom distance metrics
Implement the `Metric` trait:
```rust
use umap::Metric;
use ndarray::{Array1, ArrayView1};
#[derive(Debug)]
struct MyMetric;
impl Metric for MyMetric {
fn distance(&self, a: ArrayView1<f32>, b: ArrayView1<f32>) -> (f32, Array1<f32>) {
// Return (distance, gradient)
// gradient = ∂distance/∂a
todo!()
}
// Optional: provide fast squared distance for optimization
fn squared_distance(&self, a: ArrayView1<f32>, b: ArrayView1<f32>) -> Option<f32> {
None // Return Some(dist_sq) if available
}
}
// Use custom metric
let umap = Umap::with_metrics(
config,
Box::new(MyMetric), // Input space metric
Box::new(EuclideanMetric), // Output space metric
);
```
See `src/distances.rs` for the Euclidean implementation example.
## Build
```bash
cargo build --release
```
## Performance
This implementation is designed for large-scale datasets (100M+ samples) on high-core-count machines.
### Real-world benchmark
264 million samples embedded to 2D in 2 hours on a 126-core AMD EPYC 9J45 with 1.4 TB RAM:
- Precomputed KNN (n_neighbors=100)
- Precomputed PCA initialization
- Symmetrization disabled
- ~1 TB peak memory
### Parallelization
Every phase is fully parallelized via Rayon:
- **Graph construction**: Parallel smooth KNN distance, parallel CSR matrix construction
- **Set operations**: Parallel CSC structure building, parallel symmetrization
- **Optimizer initialization**: Parallel edge filtering, parallel epoch scheduling
- **SGD optimization**: Lock-free Hogwild! algorithm for parallel gradient descent
### Memory Efficiency
Optimized for minimal memory footprint at scale:
- **Direct CSR construction**: Builds sparse matrices in-place without intermediate COO/triplet format. Avoids O(nnz) temporary allocations and O(nnz log nnz) global sorting.
- **u32 indices**: Uses 4-byte indices instead of 8-byte, halving index memory for datasets up to ~4B samples.
- **CSC structure-only**: Transpose operations store only structure (indptr + indices), looking up values in original CSR via O(log k) binary search.
- **Sequential array allocation**: Large arrays are allocated one at a time to avoid memory spikes.
- **No cloning**: Avoids sequential `.clone()` on large arrays; uses parallel copies when needed.
### Scaling Guidelines
Memory scales with `n_samples × n_neighbors`:
| 10M | 30 | ~10 GB |
| 100M | 30 | ~100 GB |
| 250M | 30 | ~250 GB |
| 250M | 256 | ~2 TB |
To reduce memory:
- Use smaller `n_neighbors` (15-50 is typical for visualization)
- Disable symmetrization: `config.graph.symmetrize = false`
- Slice KNN arrays to use fewer neighbors than computed
### Configuration for Large Datasets
```rust
let config = UmapConfig {
graph: GraphParams {
n_neighbors: 30, // Lower = less memory
symmetrize: false, // Skip symmetrization to save memory
..Default::default()
},
..Default::default()
};
```
### Timing Logs
The library emits structured logs via the `tracing` crate. Enable a subscriber to see timing for each phase:
```rust
tracing_subscriber::fmt::init(); // or your preferred subscriber
```
Example output:
```
INFO umap_rs::umap::fuzzy_simplicial_set: smooth_knn_dist complete duration_ms=52033
INFO umap_rs::umap::fuzzy_simplicial_set: csr row_counts complete duration_ms=48495
INFO umap_rs::umap::fuzzy_simplicial_set: csr indptr complete duration_ms=560 nnz=62586367074
INFO umap_rs::optimizer: optimizer edge filtering complete duration_ms=725 total_edges=23276942679
```
### Advanced: Accessing the Graph
The fuzzy simplicial set graph is exposed as `SparseMat` (a `CsMatI<f32, u32, usize>`):
```rust
use umap_rs::SparseMat;
let manifold = umap.learn_manifold(data.view(), knn_indices.view(), knn_dists.view());
let graph: &SparseMat = manifold.graph();
// graph uses u32 column indices (memory efficient) and usize row pointers (handles large nnz)
```
## Documentation
- [UMAP.md](UMAP.md) - How UMAP works (algorithm explanation)
- [DIVERGENCES.md](DIVERGENCES.md) - Differences from Python umap-learn
- [AGENTS.md](AGENTS.md) - Developer notes
Run `cargo doc --open` to browse the API documentation.
## Design principles
1. **Minimal** - Core algorithm only, no feature creep
2. **Fast** - Parallel by default, zero-copy where possible
3. **Explicit** - Caller provides KNN, initialization, etc.
4. **Rust-native** - Idiomatic patterns, not Python translations
## Limitations
- **Maximum ~4 billion samples**: Uses `u32` indices internally for memory efficiency
- No input validation (assumes clean data)
- Transform not yet implemented
- Dense arrays only
- Panics on invalid input (not Result-based errors)
- Requires external KNN computation and initialization
### KNN Sentinel Values
If your KNN search couldn't find `k` neighbors for some points (e.g., isolated points), use `u32::MAX` as a sentinel index and any distance value (commonly `f32::INFINITY`). These entries are automatically skipped during graph construction:
```rust
// Point 5 only has 2 real neighbors, rest are sentinels
knn_indices[[5, 0]] = 10; // real neighbor
knn_indices[[5, 1]] = 23; // real neighbor
knn_indices[[5, 2]] = u32::MAX; // sentinel - skipped
knn_indices[[5, 3]] = u32::MAX; // sentinel - skipped
```
This is a specialized tool for the core algorithm. Wrap it in validation/error handling for production use.
## License
BSD-3-Clause (see LICENSE file)
## References
- Original paper: McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv:1802.03426
- Hogwild! SGD: Recht, B., et al. (2011). Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. NIPS 2011
- Python umap-learn: https://github.com/lmcinnes/umap