fast-umap
GPU-accelerated parametric UMAP (Uniform Manifold Approximation and Projection) in Rust, built on burn + CubeCL.
See docs.rs for the full API reference.
Highlights
- Up to 4.7× faster than umap-rs on datasets ≥ 10 000 samples (see benchmarks below)
- Parametric — trains a neural network, so you can
transform()new unseen data instantly - GPU-accelerated — custom CubeCL kernels for pairwise distance and KNN, compiled for Metal / Vulkan / DX12 via WGPU
- API mirrors umap-rs — drop-in replacement with
Umap::new(config).fit(data) - Automatic differentiation — full autograd through custom GPU kernels
- CPU fallback — runs on NdArray backend (no GPU required for inference or tests)
Performance — fast-umap vs umap-rs
Benchmarked on Apple M3 Max. Both crates receive identical random data. fast-umap runs 50 epochs (parametric, GPU); umap-rs runs 200 epochs (classical SGD, CPU). Total time includes data prep + KNN + fit + extract.
| Dataset | fast-umap | umap-rs | Speedup |
|---|---|---|---|
| 500 × 50 | 0.84s | 0.08s | 0.10× (umap-rs faster) |
| 1 000 × 50 | 2.19s | 0.12s | 0.05× (umap-rs faster) |
| 2 000 × 100 | 3.65s | 0.44s | 0.12× (umap-rs faster) |
| 5 000 × 100 | 6.75s | 2.31s | 0.34× (umap-rs faster) |
| 10 000 × 100 | 5.93s | 8.68s | 1.5× faster 🚀 |
| 20 000 × 100 | 7.32s | 34.10s | 4.7× faster 🚀 |
Crossover ≈ 10 000 samples. Below that, umap-rs wins on raw CPU efficiency for small data. Above it, fast-umap pulls ahead and the gap widens with dataset size — umap-rs's brute-force KNN scales O(n²) while fast-umap's per-epoch cost is capped.
Why fast-umap wins at scale:
| Technique | Effect |
|---|---|
| Sparse edge-based loss | O(n·k) per epoch instead of O(n²) |
| Edge subsampling | Caps edges/epoch at 50K regardless of n |
| Pre-batched negative samples on GPU | Zero per-epoch CPU→GPU transfers |
| Fused index gather | 2 GPU select() calls instead of 4 |
| Async loss readback | GPU→CPU sync every 5 epochs, not every epoch |
| In-memory checkpointing | No disk I/O during training |
| GPU-accelerated KNN | Full n×n pairwise distance on GPU (one-time cost) |
Reproduce:
./bench.sh --only comparison
Or run all benchmarks at once (hardware + comparison + MNIST):
./bench.sh
What's New (v1.2.2)
See CHANGELOG.md for the full release history.
| Area | Change |
|---|---|
| GPU cooldown | New cooldown_ms parameter — sleep N ms between epochs to prevent 100 % GPU utilisation; default 0 (no change to existing behaviour) |
| UMAP kernel | Proper q = 1/(1 + a·d^(2b)) kernel with a, b fitted from min_dist/spread — replaces fixed Student-t 1/(1+d²) for better cluster separation |
| Configurable negative sampling | neg_sample_rate parameter (default 5); formula fixed from n_pos × rate / k → n_pos × rate |
| Verbose logging | All training output gated behind verbose flag; improved structured messages with timings, edge counts, kernel params, stop reasons |
| ManifoldParams | min_dist and spread now actively shape the embedding kernel (previously defined but unused) |
| New API | Umap::new(config).fit(data) returns FittedUmap with .embedding(), .transform(), .into_embedding() — mirrors umap-rs |
| Sparse training | O(n·k) per epoch with edge subsampling + configurable negative sampling (was O(n²)) |
Features
- Dimensionality Reduction — projects high-dimensional data to 2-D or 3-D for visualization
- Parametric model — learned neural network can project new, unseen data via
transform() - GPU-accelerated kernels — custom CubeCL kernels for Euclidean pairwise distance and KNN, compiled for WGPU (Metal / Vulkan / DX12)
- Automatic differentiation — full autograd through the custom kernels via burn's autodiff backend
- Sparse training — edge subsampling + negative sampling keeps per-epoch cost constant regardless of dataset size
- Flexible architecture — configurable hidden layers, output dims, distance metric, learning rate, early stopping, timeout
- CPU fallback — all model code runs on NdArray (no GPU required for inference or tests)
- 36 unit tests — covering normalization, tensor conversion, model shape, distance math
- Hardware-tagged benchmarks — CPU and GPU timings saved as Markdown + SVG, including a CPU vs GPU comparison chart
Installation
cargo add fast-umap
Cargo.toml:
[]
= "1.2.2"
= { = "0.20.1", = ["wgpu", "autodiff", "autotune"] }
= { = "0.9.0", = ["wgpu"] }
Quick Start
use WgpuRuntime;
use *;
type MyBackend = CubeBackend;
type MyAutodiffBackend = Autodiff;
// 100 samples × 10 features
let data: = generate_test_data
.chunks
.map
.collect;
// Fit UMAP (default: 2-D output, Euclidean distance)
let config = default;
let umap = new;
let fitted = umap.fit;
// Get embedding
let embedding = fitted.embedding;
println!;
Backend Choice: CPU vs GPU with Feature-Based Compilation
fast-umap supports both CPU and GPU backends with feature-based compilation for optimal flexibility:
GPU Backend (WGPU) - Primary Recommended Backend
For GPU-accelerated execution (requires WGPU-compatible GPU):
use WgpuRuntime;
use *;
type MyBackend = CubeBackend;
type MyAutodiffBackend = Autodiff;
let config = default;
let umap = new;
let fitted = umap.fit;
Features:
- ✅ Full parametric UMAP with neural network training
- ✅ GPU acceleration via WGPU (Metal/Vulkan/DX12)
- ✅ Transform new data through trained model
- ✅ Custom GPU kernels for efficient computation
- ✅ Automatic differentiation for training
CPU Backend - Full Functionality with umap-rs Fallback
For CPU-only execution or when GPU is not available, fast-umap provides a complete CPU backend that uses classical UMAP computation:
Features:
- ✅ Complete UMAP functionality using classical UMAP algorithm
- ✅ No GPU required
- ✅ Same API as GPU backend for consistency
- ✅ Full configuration support (n_components, n_neighbors, etc.)
- ✅ Excellent for environments without GPU access
- ❌ Cannot transform new data (classical UMAP limitation)
When to use CPU backend:
- Development and testing environments without GPU
- Cloud environments with CPU-only instances
- Edge devices without GPU acceleration
- Fallback when GPU drivers are unavailable
See cpu_training_demo example for complete CPU usage.
Feature-Based Compilation
fast-umap uses Cargo features to enable only the backends you need:
# GPU backend (default, includes gpu + verbose features)
cargo build --release
# CPU-only backend
cargo build --release --features cpu
# Minimal build (no backends, library mode only)
cargo build --release --no-default-features
# All features (development)
cargo build --release --features all
See feature_demo example for complete feature matrix.
Runtime Backend Selection
The backend_choice example demonstrates runtime backend selection:
# Run with GPU backend
cargo run --release --example backend_choice gpu
# Run with CPU backend
cargo run --release --features cpu --example backend_choice cpu
CPU Testing Examples
For working CPU examples, see the test suite:
# Run CPU-based tests
cargo test --lib
The tests in tests/tests.rs demonstrate NdArray backend usage for CPU computation.
API Overview
The public API mirrors umap-rs:
| Type | Description |
|---|---|
Umap<B> |
Main algorithm struct — Umap::new(config) |
FittedUmap<B> |
Fitted model — .embedding(), .transform(), .into_embedding(), .config() |
UmapConfig |
Configuration with nested GraphParams + OptimizationParams |
ManifoldParams |
min_dist, spread — control cluster tightness and separation |
GraphParams |
n_neighbors, metric, normalized, minkowski_p |
OptimizationParams |
n_epochs, learning_rate, patience, timeout, verbose, neg_sample_rate, … |
Metric |
Euclidean, EuclideanKNN, Manhattan, Cosine |
Configuration
let config = UmapConfig ;
Key parameters
| Parameter | Default | Description |
|---|---|---|
n_components |
2 | Output dimensionality (2-D or 3-D) |
hidden_sizes |
[100, 100, 100] |
Neural network hidden layer sizes |
min_dist |
0.1 | Min distance in embedding — smaller = tighter clusters |
spread |
1.0 | Effective scale of embedded points |
n_neighbors |
15 | KNN graph neighbours |
n_epochs |
200 | Training epochs |
learning_rate |
1e-3 | Adam step size |
batch_size |
1 000 | Samples per batch |
penalty |
0.0 | L2 weight decay |
metric |
Euclidean |
Distance metric |
repulsion_strength |
1.0 | Repulsion term weight |
neg_sample_rate |
5 | Negative (repulsion) samples per positive edge per epoch |
cooldown_ms |
0 |
Sleep N ms between epochs to reduce GPU utilisation (0 = disabled) |
patience |
None |
Early-stop epochs without improvement |
min_desired_loss |
None |
Stop when loss ≤ threshold |
timeout |
None |
Hard time limit (seconds) |
verbose |
true |
Progress bar + loss plots |
Transform new data
Because fast-umap is parametric (neural network), it can project new unseen data — something classical UMAP cannot do:
let fitted = umap.fit;
// Project new data through the trained model
let new_embedding = fitted.transform;
Examples
CPU Capabilities Demo
Demonstrates CPU backend capabilities for utility functions and tensor operations:
cargo run --release --example cpu_training_demo
Backend Choice — CPU vs GPU selection
Demonstrates how to choose between CPU (NdArray) and GPU (WGPU) backends at runtime:
# Run with GPU backend (default)
cargo run --release --example backend_choice gpu
# Run with CPU backend
cargo run --release --example backend_choice cpu
Note about CPU backend: The CPU option shows backend selection structure. For utility functions, see cpu_training_demo. The GPU backend provides full parametric UMAP with neural network training and the ability to transform new data.
Simple — random data, 2-D embedding
cargo run --release --example simple
use WgpuRuntime;
use *;
Advanced — 1 000 samples, custom config
cargo run --release --example advanced
use WgpuRuntime;
use *;
MNIST — 10 000 hand-written digits projected to 2-D
cargo run --release --example mnist # quick run (no figures)
./bench.sh --only mnist # generate figures
Downloads the MNIST dataset on first run (~12 MB), trains UMAP on 10K digits (784 features → 2-D).
| 2-D digit embedding (coloured by class) | Loss curve |
|---|---|
![]() |
![]() |
let config = UmapConfig ;
let umap = new;
let fitted = umap.fit;
Generating Figures
All figures are generated by the benchmark suite, not by the examples. Examples are lightweight smoke tests that verify correctness without writing any files.
# Run all benchmarks and generate all figures
./bench.sh
# Run only specific benchmarks
./bench.sh --only comparison # fast-umap vs umap-rs chart
./bench.sh --only mnist # MNIST embedding + loss curve
./bench.sh --only hardware # CPU + GPU micro-benchmarks
./bench.sh --only mnist comparison # combine multiple
# Skip MNIST (saves ~70s)
./bench.sh --skip-mnist
# Include Criterion statistical suite
./bench.sh --criterion
What generates what
| Benchmark | Output |
|---|---|
comparison |
figures/crate_comparison.{svg,json,md} |
mnist |
figures/mnist.png, figures/losses_model.png |
hardware |
benches/results/{cpu,gpu,comparison}_*.{svg,md} |
criterion |
target/criterion/ (HTML reports) |
Run examples (no figures)
./run_all.sh # all examples
./run_all.sh --skip-mnist # skip MNIST download
CPU Capabilities Demo
Demonstrates CPU backend capabilities:
# Run CPU demo (requires cpu feature)
cargo run --release --features cpu --example cpu_training_demo
Developer Experience Improvements
The feature-based compilation system provides:
✅ Faster Compilation
- Compile only what you need:
cargo build --release --features cpu - Minimal builds:
cargo build --release --no-default-features - Feature-specific examples and documentation
✅ Smaller Binaries
- GPU-only: exclude CPU dependencies (save ~5MB)
- CPU-only: exclude GPU dependencies (save ~10MB)
- Library mode: minimal footprint for integration
✅ Clear Feature Matrix
| Feature | Description | When to Use |
|---|---|---|
gpu |
GPU backend (WGPU) | Production, large datasets |
cpu |
CPU backend (umap-rs) | CPU-only environments |
verbose |
Progress output | Development, debugging |
plotters |
Visualization | Exploration, analysis |
all |
Everything | Development, testing |
✅ Better Error Messages
- Clear feature requirements at compile time
- Graceful runtime fallback handling
- Comprehensive validation
See feature_demo example for complete feature guide.
Architecture
fast-umap uses a parametric approach — a small feed-forward neural network is trained with the UMAP cross-entropy loss:
attraction = mean_{k-NN edges} [ −log q_ij ]
repulsion = mean_{negative samples} [ −log (1 − q_ij) ]
loss = attraction + repulsion_strength × repulsion
where q_ij = 1 / (1 + a · d_ij^(2b)) is the UMAP kernel applied to embedding
distances (a and b are fitted from min_dist / spread).
Training pipeline
Input data [n, features]
│
▼
GPU pairwise distance → KNN graph (one-time O(n²) cost)
│
▼
┌─── Per epoch (cost: O(min(n·k, 50K))) ───┐
│ Forward pass: data → neural net → [n, 2] │
│ Edge subsampling from KNN graph │
│ Negative sampling (random non-neighbors) │
│ UMAP cross-entropy loss │
│ Backward pass + Adam optimizer step │
└────────────────────────────────────────────┘
│
▼
FittedUmap with .embedding() and .transform()
Modules
| Module | Description |
|---|---|
[model] |
UMAPModel neural network and config builder |
[train] |
Training loop, UmapConfig, sparse training, loss computation |
[chart] |
2-D scatter plots and loss curves (plotters) |
[utils] |
Data generation, tensor conversion, normalisation |
[kernels] |
Custom CubeCL GPU kernels (Euclidean distance, k-NN) |
[backend] |
Backend trait extension for custom kernel dispatch |
[distances] |
CPU-side distance functions (Euclidean, cosine, Minkowski…) |
[prelude] |
Re-exports of the most commonly used items |
Legacy API
The original UMAP struct and train() function are still available for
backward compatibility:
use *;
// Legacy one-liner
let umap: UMAP = umap;
let embedding = umap.transform;
// Legacy manual training
let = train;
Note: The legacy API uses the dense O(n²) training path. Use the new
Umap::new(config).fit(data)API for the optimized sparse training path.
Testing
All tests run on CPU (burn::backend::NdArray) — no GPU required.
cargo test
| Category | What is covered |
|---|---|
normalize_data |
correctness, zero-mean/unit-std, constant columns |
format_duration |
zero, seconds, minutes, hours |
generate_test_data |
shape, bounds [0, 1) |
tensor_convert |
round-trip Vec → Tensor → Vec, NaN → 0 |
normalize_tensor |
output in [0, 1], constant-input safety |
layer_normalize |
no NaN, shape preserved |
UMAPModelConfigBuilder |
defaults and custom values |
TrainingConfig |
builder, Metric::from(&str), invalid-metric panic |
UMAPModel |
2-D / 3-D output, deep network, determinism, no NaN |
| Distance math | Euclidean self=0, 3-4-5 triangle, symmetry; Manhattan |
Micro-benchmarks
Reproduce: ./bench.sh --only hardware
Full detail files: cpu_apple_m3_max.md · gpu_apple_m3_max.md · comparison_apple_m3_max.md · cpu_apple_silicon_aarch64.md
CPU — Apple M3 Max (NdArray backend)
| Benchmark | Input | Min | Mean | Max |
|---|---|---|---|---|
normalize_data |
100×10 | 345 µs | 517 µs | 986 µs |
normalize_data |
500×30 | 1.92 ms | 2.31 ms | 2.77 ms |
normalize_data |
1 000×50 | 4.35 ms | 4.78 ms | 5.79 ms |
normalize_data |
5 000×100 | 16.4 ms | 18.2 ms | 19.4 ms |
generate_test_data |
100×10 | 3.62 µs | 4.03 µs | 5.92 µs |
generate_test_data |
500×30 | 56.3 µs | 57.8 µs | 61.9 µs |
generate_test_data |
1 000×50 | 246 µs | 258 µs | 288 µs |
generate_test_data |
5 000×100 | 2.46 ms | 2.48 ms | 2.54 ms |
tensor_convert |
100×10 | 5.08 µs | 5.16 µs | 5.33 µs |
tensor_convert |
500×30 | 32.4 µs | 33.0 µs | 35.8 µs |
tensor_convert |
1 000×50 | 74.6 µs | 78.1 µs | 89.5 µs |
model_forward |
16s×10f [32]→2 | 21.4 µs | 34.5 µs | 55.8 µs |
model_forward |
64s×50f [64]→2 | 26.7 µs | 34.2 µs | 51.2 µs |
model_forward |
128s×50f [128]→2 | 55.9 µs | 58.6 µs | 64.2 µs |
model_forward |
64s×100f [128,64]→3 | 70.0 µs | 80.3 µs | 106 µs |
model_forward |
256s×100f [256,128]→2 | 279 µs | 293 µs | 310 µs |
normalize_tensor |
n=64 | 1.88 µs | 1.95 µs | 2.29 µs |
normalize_tensor |
n=512 | 2.79 µs | 2.86 µs | 3.00 µs |
normalize_tensor |
n=4 096 | 9.50 µs | 9.59 µs | 10.3 µs |
normalize_tensor |
n=32 768 | 70.0 µs | 70.3 µs | 70.7 µs |
layer_normalize |
32×16 | 3.50 µs | 3.62 µs | 3.96 µs |
layer_normalize |
128×64 | 19.7 µs | 20.0 µs | 20.9 µs |
layer_normalize |
512×128 | 115 µs | 117 µs | 130 µs |
layer_normalize |
1 000×256 | 412 µs | 420 µs | 454 µs |
GPU — Apple M3 Max (WGPU / Metal)
| Benchmark | Input | Min | Mean | Max |
|---|---|---|---|---|
model_forward |
16s×10f [32]→2 | 408 µs | 617 µs | 894 µs |
model_forward |
64s×50f [64]→2 | 430 µs | 481 µs | 776 µs |
model_forward |
128s×50f [128]→2 | 432 µs | 475 µs | 576 µs |
model_forward |
64s×100f [128,64]→3 | 549 µs | 688 µs | 1.82 ms |
model_forward |
256s×100f [256,128]→2 | 631 µs | 691 µs | 828 µs |
model_forward |
512s×100f [256,128]→2 | 926 µs | 1.08 ms | 1.42 ms |
normalize_tensor |
n=512 | 572 µs | 695 µs | 1.28 ms |
normalize_tensor |
n=4 096 | 590 µs | 662 µs | 821 µs |
normalize_tensor |
n=32 768 | 629 µs | 712 µs | 883 µs |
normalize_tensor |
n=262 144 | 1.08 ms | 1.12 ms | 1.22 ms |
layer_normalize |
128×64 | 437 µs | 471 µs | 609 µs |
layer_normalize |
512×128 | 467 µs | 500 µs | 648 µs |
layer_normalize |
1 000×256 | 617 µs | 662 µs | 777 µs |
layer_normalize |
4 000×512 | 1.81 ms | 1.93 ms | 2.15 ms |
CPU vs GPU — Apple M3 Max
| Benchmark | Input | CPU | GPU | Speedup |
|---|---|---|---|---|
model_forward |
16s×10f [32]→2 | 34.5 µs | 617 µs | 0.06× (CPU faster) |
model_forward |
64s×50f [64]→2 | 34.2 µs | 481 µs | 0.07× (CPU faster) |
model_forward |
128s×50f [128]→2 | 58.6 µs | 475 µs | 0.12× (CPU faster) |
model_forward |
64s×100f [128,64]→3 | 80.3 µs | 688 µs | 0.12× (CPU faster) |
model_forward |
256s×100f [256,128]→2 | 293 µs | 691 µs | 0.42× (CPU faster) |
normalize_tensor |
n=512 | 2.86 µs | 695 µs | 0.00× (CPU faster) |
normalize_tensor |
n=4 096 | 9.59 µs | 662 µs | 0.01× (CPU faster) |
normalize_tensor |
n=32 768 | 70.3 µs | 712 µs | 0.10× (CPU faster) |
layer_normalize |
128×64 | 20.0 µs | 471 µs | 0.04× (CPU faster) |
layer_normalize |
512×128 | 117 µs | 500 µs | 0.23× (CPU faster) |
layer_normalize |
1 000×256 | 420 µs | 662 µs | 0.63× (CPU faster) |
Note: WGPU/Metal has a fixed dispatch overhead of ~400–700 µs per kernel call. For the small model sizes above, that overhead dominates. The GPU wins in full UMAP training loops over thousands of samples where operations are chained without intermediate CPU readbacks.
Running benchmarks
# All benchmarks (hardware + comparison + MNIST)
./bench.sh
# + Criterion statistics (~5 min)
./bench.sh --criterion
# Just hardware micro-benchmarks
./bench.sh --only hardware
Roadmap
- MNIST dataset example with intermediate plots
- Charting behind a feature flag
- Labels in plots
- Batching + accumulated gradient
- CubeCL kernels for distance computation
- Hyperparameter testbench (
patiencevsn_featuresvsepochs…) - Unit tests (36) and hardware benchmarks
- New API mirroring umap-rs (
Umap,FittedUmap,UmapConfig) - Sparse training with edge subsampling + negative sampling
- Crate comparison benchmark (fast-umap vs umap-rs)
- PCA warm-start for initial embedding
- Approximate KNN (NN-descent) for datasets > 50K
License
MIT — see LICENSE.
Copyright
2024-2026, Eugene Hauptmann
Citations
If you use fast-umap in research or a project, please cite the original UMAP paper, this repository, and acknowledge the Burn and CubeCL frameworks:
fast-umap
Hauptmann, E. (2024). fast-umap: GPU-Accelerated UMAP in Rust (v1.2.2). https://github.com/eugenehp/fast-umap
UMAP algorithm
McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv:1802.03426. https://arxiv.org/abs/1802.03426
Burn deep-learning framework
CubeCL GPU compute framework
Thank you
Inspired by the original UMAP paper.

