fast-umap
UMAP (Uniform Manifold Approximation and Projection) in Rust — GPU-accelerated via burn + CubeCL.
See docs.rs for the full API reference.
What's New (v1.0.0)
| Area | Change |
|---|---|
| burn upgrade | 0.18 → 0.20.1; cubecl 0.6 → 0.9 |
| API fixes | Backend::seed now takes (&device, seed); as_tensor_arg no longer takes a generic type param; NodeID renamed NodeId |
| Kernel fixes | ABSOLUTE_POS_X/Y cast to usize for shape indexing; all for i in 0..k loops use k as usize |
| GPU fix | normalize_tensor used .to_vec::<bool>() which panics on WGPU (booleans stored as u32); replaced with f32 arithmetic comparison |
| Warning fixes | All 4 unused Result warnings from kernel launches resolved with .expect() |
| Tests | 36 unit tests added (cargo test), all CPU-only via NdArray backend |
| Benchmarks | cargo run --release --bin bench_report — auto-detects CPU & GPU, writes hardware-tagged .md + .svg + CPU vs GPU comparison |
Features
- Dimensionality Reduction — projects high-dimensional data to 2-D or 3-D for visualization
- GPU-accelerated kernels — custom CubeCL kernels for Euclidean pairwise distance and KNN, compiled for WGPU (Metal / Vulkan / DX12)
- Automatic differentiation — full autograd through the custom kernels via burn's autodiff backend
- Flexible architecture — configurable hidden layers, output dims, distance metric, batch size, learning rate, early stopping, timeout
- CPU fallback — all model code runs on NdArray (no GPU required for inference or tests)
- 36 unit tests — covering normalization, tensor conversion, model shape, distance math
- Hardware-tagged benchmarks — CPU and GPU timings saved as Markdown + SVG, including a CPU vs GPU comparison chart
Roadmap
- MNIST dataset example with intermediate plots
- Charting behind a feature flag
- Labels in plots
- Batching + accumulated gradient
- CubeCL kernels for distance computation
- Hyperparameter testbench (
patiencevsn_featuresvsepochs…) - Unit tests (36) and hardware benchmarks
- PCA warm-start for initial embedding
Testing
All tests run on CPU (burn::backend::NdArray) — no GPU required.
cargo test
| Category | What is covered |
|---|---|
normalize_data |
correctness, zero-mean/unit-std, constant columns |
format_duration |
zero, seconds, minutes, hours |
generate_test_data |
shape, bounds [0, 1) |
tensor_convert |
round-trip Vec → Tensor → Vec, NaN → 0 |
normalize_tensor |
output in [0, 1], constant-input safety |
layer_normalize |
no NaN, shape preserved |
UMAPModelConfigBuilder |
defaults and custom values |
TrainingConfig |
builder, Metric::from(&str), invalid-metric panic |
UMAPModel |
2-D / 3-D output, deep network, determinism, no NaN |
| Distance math | Euclidean self=0, 3-4-5 triangle, symmetry; Manhattan |
Benchmarks
Reproduce: cargo run --release --bin bench_report
Full detail files: cpu_apple_m3_max.md · gpu_apple_m3_max.md · comparison_apple_m3_max.md · cpu_apple_silicon_aarch64.md
CPU — Apple M3 Max (NdArray backend)
| Benchmark | Input | Min | Mean | Max |
|---|---|---|---|---|
normalize_data |
100×10 | 329 µs | 450 µs | 797 µs |
normalize_data |
500×30 | 1.88 ms | 2.22 ms | 2.98 ms |
normalize_data |
1 000×50 | 3.68 ms | 4.64 ms | 5.71 ms |
normalize_data |
5 000×100 | 15.8 ms | 17.9 ms | 19.4 ms |
generate_test_data |
100×10 | 3.58 µs | 4.05 µs | 14.5 µs |
generate_test_data |
500×30 | 56.0 µs | 57.9 µs | 72.5 µs |
generate_test_data |
1 000×50 | 189 µs | 192 µs | 203 µs |
generate_test_data |
5 000×100 | 1.89 ms | 1.91 ms | 1.97 ms |
tensor_convert |
100×10 | 4.96 µs | 5.02 µs | 5.17 µs |
tensor_convert |
500×30 | 29.7 µs | 31.2 µs | 40.6 µs |
tensor_convert |
1 000×50 | 70.8 µs | 75.7 µs | 90.4 µs |
model_forward |
16s×10f [32]→2 | 21.8 µs | 36.0 µs | 58.6 µs |
model_forward |
64s×50f [64]→2 | 27.9 µs | 34.4 µs | 53.5 µs |
model_forward |
128s×50f [128]→2 | 54.3 µs | 57.5 µs | 62.9 µs |
model_forward |
64s×100f [128,64]→3 | 71.0 µs | 74.7 µs | 85.0 µs |
model_forward |
256s×100f [256,128]→2 | 243 µs | 248 µs | 256 µs |
normalize_tensor |
n=64 | 1.96 µs | 2.03 µs | 2.17 µs |
normalize_tensor |
n=512 | 2.79 µs | 2.92 µs | 3.17 µs |
normalize_tensor |
n=4 096 | 9.50 µs | 9.70 µs | 14.6 µs |
normalize_tensor |
n=32 768 | 69.7 µs | 70.3 µs | 81.1 µs |
layer_normalize |
32×16 | 3.50 µs | 3.61 µs | 3.75 µs |
layer_normalize |
128×64 | 17.9 µs | 18.1 µs | 19.1 µs |
layer_normalize |
512×128 | 115 µs | 117 µs | 126 µs |
layer_normalize |
1 000×256 | 413 µs | 416 µs | 427 µs |
CPU — Apple Silicon aarch64 (NdArray backend)
| Benchmark | Input | Min | Mean | Max |
|---|---|---|---|---|
normalize_data |
100×10 | 8.63 ms | 9.85 ms | 11.4 ms |
normalize_data |
500×30 | 36.8 ms | 41.5 ms | 51.4 ms |
normalize_data |
1 000×50 | 67.2 ms | 72.8 ms | 78.9 ms |
normalize_data |
5 000×100 | 173 ms | 178 ms | 183 ms |
generate_test_data |
100×10 | 5.08 µs | 5.82 µs | 10.4 µs |
generate_test_data |
500×30 | 60.8 µs | 62.2 µs | 75.8 µs |
generate_test_data |
1 000×50 | 198 µs | 204 µs | 269 µs |
generate_test_data |
5 000×100 | 2.26 ms | 2.59 ms | 3.20 ms |
tensor_convert |
100×10 | 20.3 µs | 22.6 µs | 41.5 µs |
tensor_convert |
500×30 | 129 µs | 138 µs | 159 µs |
tensor_convert |
1 000×50 | 328 µs | 354 µs | 404 µs |
model_forward |
16s×10f [32]→2 | 34.7 µs | 121 µs | 2.40 ms |
model_forward |
64s×50f [64]→2 | 45.7 µs | 111 µs | 1.35 ms |
model_forward |
128s×50f [128]→2 | 162 µs | 263 µs | 934 µs |
model_forward |
64s×100f [128,64]→3 | 291 µs | 315 µs | 362 µs |
model_forward |
256s×100f [256,128]→2 | 790 µs | 857 µs | 1.03 ms |
normalize_tensor |
n=64 | 4.46 µs | 6.10 µs | 21.6 µs |
normalize_tensor |
n=512 | 3.54 µs | 4.16 µs | 32.1 µs |
normalize_tensor |
n=4 096 | 12.2 µs | 13.8 µs | 27.8 µs |
normalize_tensor |
n=32 768 | 176 µs | 195 µs | 227 µs |
layer_normalize |
32×16 | 4.08 µs | 4.95 µs | 21.0 µs |
layer_normalize |
128×64 | 64.8 µs | 75.8 µs | 101 µs |
layer_normalize |
512×128 | 370 µs | 410 µs | 455 µs |
layer_normalize |
1 000×256 | 1.44 ms | 1.56 ms | 1.90 ms |
GPU — Apple M3 Max (WGPU / Metal)
| Benchmark | Input | Min | Mean | Max |
|---|---|---|---|---|
model_forward |
16s×10f [32]→2 | 419 µs | 565 µs | 770 µs |
model_forward |
64s×50f [64]→2 | 424 µs | 466 µs | 585 µs |
model_forward |
128s×50f [128]→2 | 424 µs | 474 µs | 551 µs |
model_forward |
64s×100f [128,64]→3 | 565 µs | 637 µs | 825 µs |
model_forward |
256s×100f [256,128]→2 | 638 µs | 675 µs | 724 µs |
model_forward |
512s×100f [256,128]→2 | 921 µs | 1.10 ms | 1.32 ms |
normalize_tensor |
n=512 | 560 µs | 660 µs | 957 µs |
normalize_tensor |
n=4 096 | 555 µs | 650 µs | 828 µs |
normalize_tensor |
n=32 768 | 642 µs | 711 µs | 844 µs |
normalize_tensor |
n=262 144 | 1.06 ms | 1.13 ms | 1.23 ms |
layer_normalize |
128×64 | 365 µs | 419 µs | 648 µs |
layer_normalize |
512×128 | 407 µs | 503 µs | 1.46 ms |
layer_normalize |
1 000×256 | 568 µs | 715 µs | 1.67 ms |
layer_normalize |
4 000×512 | 1.75 ms | 2.12 ms | 3.01 ms |
CPU vs GPU — Apple M3 Max
| Benchmark | Input | CPU | GPU | Speedup |
|---|---|---|---|---|
model_forward |
16s×10f [32]→2 | 36.0 µs | 565 µs | 0.06× (CPU faster) |
model_forward |
64s×50f [64]→2 | 34.4 µs | 466 µs | 0.07× (CPU faster) |
model_forward |
128s×50f [128]→2 | 57.5 µs | 474 µs | 0.12× (CPU faster) |
model_forward |
64s×100f [128,64]→3 | 74.7 µs | 637 µs | 0.12× (CPU faster) |
model_forward |
256s×100f [256,128]→2 | 248 µs | 675 µs | 0.37× (CPU faster) |
normalize_tensor |
n=512 | 2.92 µs | 660 µs | 0.00× (CPU faster) |
normalize_tensor |
n=4 096 | 9.70 µs | 650 µs | 0.01× (CPU faster) |
normalize_tensor |
n=32 768 | 70.3 µs | 711 µs | 0.10× (CPU faster) |
layer_normalize |
128×64 | 18.1 µs | 419 µs | 0.04× (CPU faster) |
layer_normalize |
512×128 | 117 µs | 503 µs | 0.23× (CPU faster) |
layer_normalize |
1 000×256 | 416 µs | 715 µs | 0.58× (CPU faster) |
Why is CPU faster for these sizes?
WGPU/Metal has a fixed dispatch overhead of ~400–700 µs per kernel call. For the small model sizes benchmarked above that overhead dominates over the actual compute. The GPU wins when:
- Batch sizes are large (≥ 10 000 samples per forward pass)
- Models are wide/deep (≥ 1 024 hidden units)
- Operations are chained without intermediate CPU readbacks
- The custom CubeBackend kernels run on large N×N pairwise distance matrices (cost grows O(N²), quickly justifying dispatch overhead)
In a full UMAP training loop over thousands of samples the GPU path is significantly faster; the numbers above reflect isolated micro-benchmarks with a forced CPU↔GPU sync after every operation.
Running benchmarks on your Mac
# 1. Prerequisites
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh # Rust
xcode-select --install # Metal / Xcode CLT
# 2. Run — writes hardware-tagged files to benches/results/
cargo run --release --bin bench_report
# 3. Open the comparison chart
open benches/results/comparison.svg
Or use the convenience script:
./bench.sh # hardware report only (~1 min)
./bench.sh --criterion # + Criterion stats (~5 min)
Output files written automatically:
benches/results/
cpu_<chip>.md / .svg CPU timings
gpu_<chip>.md / .svg GPU timings (when GPU available)
comparison_<chip>.md / .svg CPU vs GPU, speedup ratios
benchmark_results.md / .svg copy of latest CPU run
Installation
cargo add fast-umap
Cargo.toml:
[]
= "1.0.0"
= { = "0.20.1", = ["wgpu", "autodiff", "autotune"] }
= { = "0.9.0", = ["wgpu"] }
Quick start
use WgpuRuntime;
use *;
Model Configuration
let model_config = default
.input_size // input dimensionality
.hidden_sizes // one or more hidden layers
.output_size // 2-D or 3-D embedding
.build
.unwrap;
Training
use WgpuRuntime;
use ;
type MyBackend = CubeBackend;
type MyAutodiffBackend = Autodiff;
let device = default;
seed;
let num_samples = 1_000;
let num_features = 100;
let train_data: = generate_test_data;
let model: = new;
let config = builder
.with_epochs
.with_batch_size
.with_learning_rate
.with_beta1
.with_beta2
.with_k_neighbors
.with_metric
.with_verbose
.with_patience
.with_min_desired_loss
.with_timeout
.build
.unwrap;
let = unbounded;
set_handler.unwrap;
let = train;
// Inference
let model = model.valid;
let input = convert_vector_to_tensor;
let embedding = model.forward; // shape [num_samples, 2]
Key training parameters
| Parameter | Default | Description |
|---|---|---|
epochs |
100 | Training epochs |
batch_size |
1 000 | Samples per batch |
learning_rate |
1e-3 | Adam step size |
beta1 / beta2 |
0.9 / 0.999 | Adam moment decay |
penalty |
0.0 | L2 weight decay |
k_neighbors |
10 | KNN graph neighbours |
metric |
Euclidean |
Euclidean · EuclideanKNN · Manhattan · Cosine |
patience |
None |
Early-stop epochs without improvement |
min_desired_loss |
None |
Stop when loss ≤ threshold |
timeout |
None |
Hard time limit (seconds) |
verbose |
true |
Progress bar |
Examples
Simple — random data, 2-D embedding
cargo run --example simple
use WgpuRuntime;
use *;
Advanced — 1 000 samples, 100 features, custom config
cargo run --example advanced
use WgpuRuntime;
use ;
use unbounded;
MNIST — 70 000 hand-written digits projected to 2-D
cargo run --example mnist
Downloads the MNIST dataset on first run (~12 MB), trains UMAP, saves:
| Output | File |
|---|---|
| 2-D digit embedding (coloured by class) | ![]() |
| MNIST loss curve | ![]() |
Regenerating example images
The images above are generated by the examples themselves (GPU required). To refresh them on your Mac:
# Regenerate all example output images
./generate_images.sh
Or individually:
cargo run --release --example simple # → plot.png
cargo run --release --example advanced # → plot.png + losses.png
cargo run --release --example mnist # → mnist.png + losses_mnist.png
License
MIT — see LICENSE.
Copyright
2024-2026, Eugene Hauptmann
Citations
If you use fast-umap in research or a project, please cite the original UMAP paper, this repository, and acknowledge the Burn and CubeCL frameworks:
fast-umap
Hauptmann, E. (2024). fast-umap: GPU-Accelerated UMAP in Rust (v1.0.0). https://github.com/eugenehp/fast-umap
UMAP algorithm
McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv:1802.03426. https://arxiv.org/abs/1802.03426
Burn deep-learning framework
CubeCL GPU compute framework
Thank you
Inspired by the original UMAP paper.

