fast-umap

GPU-accelerated parametric UMAP (Uniform Manifold Approximation and Projection) in Rust, built on burn + CubeCL.

See docs.rs for the full API reference.

Highlights

Up to 4.7× faster than umap-rs on datasets ≥ 10 000 samples (see benchmarks below)
Parametric — trains a neural network, so you can transform() new unseen data instantly
GPU-accelerated — custom CubeCL kernels for pairwise distance and KNN, compiled for Metal / Vulkan / DX12 via WGPU
API mirrors umap-rs — drop-in replacement with Umap::new(config).fit(data)
Automatic differentiation — full autograd through custom GPU kernels
CPU fallback — runs on NdArray backend (no GPU required for inference or tests)

Performance — fast-umap vs umap-rs

Benchmarked on Apple M3 Max. Both crates receive identical random data. fast-umap runs 50 epochs (parametric, GPU); umap-rs runs 200 epochs (classical SGD, CPU). Total time includes data prep + KNN + fit + extract.

Crate comparison chart

Dataset	fast-umap	umap-rs	Speedup
500 × 50	0.84s	0.08s	0.10× (umap-rs faster)
1 000 × 50	2.19s	0.12s	0.05× (umap-rs faster)
2 000 × 100	3.65s	0.44s	0.12× (umap-rs faster)
5 000 × 100	6.75s	2.31s	0.34× (umap-rs faster)
10 000 × 100	5.93s	8.68s	1.5× faster 🚀
20 000 × 100	7.32s	34.10s	4.7× faster 🚀

Crossover ≈ 10 000 samples. Below that, umap-rs wins on raw CPU efficiency for small data. Above it, fast-umap pulls ahead and the gap widens with dataset size — umap-rs's brute-force KNN scales O(n²) while fast-umap's per-epoch cost is capped.

Why fast-umap wins at scale:

Technique	Effect
Sparse edge-based loss	O(n·k) per epoch instead of O(n²)
Edge subsampling	Caps edges/epoch at 50K regardless of n
Pre-batched negative samples on GPU	Zero per-epoch CPU→GPU transfers
Fused index gather	2 GPU `select()` calls instead of 4
Async loss readback	GPU→CPU sync every 5 epochs, not every epoch
In-memory checkpointing	No disk I/O during training
GPU-accelerated KNN	Full n×n pairwise distance on GPU (one-time cost)

Reproduce:

./bench.sh --only comparison

Or run all benchmarks at once (hardware + comparison + MNIST):

./bench.sh

What's New (v1.2.2)

See CHANGELOG.md for the full release history.

Area	Change
GPU cooldown	New `cooldown_ms` parameter — sleep N ms between epochs to prevent 100 % GPU utilisation; default `0` (no change to existing behaviour)
UMAP kernel	Proper `q = 1/(1 + a·d^(2b))` kernel with `a`, `b` fitted from `min_dist`/`spread` — replaces fixed Student-t `1/(1+d²)` for better cluster separation
Configurable negative sampling	`neg_sample_rate` parameter (default 5); formula fixed from `n_pos × rate / k` → `n_pos × rate`
Verbose logging	All training output gated behind `verbose` flag; improved structured messages with timings, edge counts, kernel params, stop reasons
ManifoldParams	`min_dist` and `spread` now actively shape the embedding kernel (previously defined but unused)
New API	`Umap::new(config).fit(data)` returns `FittedUmap` with `.embedding()`, `.transform()`, `.into_embedding()` — mirrors umap-rs
Sparse training	O(n·k) per epoch with edge subsampling + configurable negative sampling (was O(n²))

Features

Dimensionality Reduction — projects high-dimensional data to 2-D or 3-D for visualization
Parametric model — learned neural network can project new, unseen data via transform()
GPU-accelerated kernels — custom CubeCL kernels for Euclidean pairwise distance and KNN, compiled for WGPU (Metal / Vulkan / DX12)
Automatic differentiation — full autograd through the custom kernels via burn's autodiff backend
Sparse training — edge subsampling + negative sampling keeps per-epoch cost constant regardless of dataset size
Flexible architecture — configurable hidden layers, output dims, distance metric, learning rate, early stopping, timeout
CPU fallback — all model code runs on NdArray (no GPU required for inference or tests)
36 unit tests — covering normalization, tensor conversion, model shape, distance math
Hardware-tagged benchmarks — CPU and GPU timings saved as Markdown + SVG, including a CPU vs GPU comparison chart

Installation

cargo add fast-umap

Cargo.toml:

[dependencies]
fast-umap  = "1.2.2"
burn       = { version = "0.20.1", features = ["wgpu", "autodiff", "autotune"] }
cubecl     = { version = "0.9.0",  features = ["wgpu"] }

Quick Start

use cubecl::wgpu::WgpuRuntime;
use fast_umap::prelude::*;

type MyBackend = burn::backend::wgpu::CubeBackend<WgpuRuntime, f32, i32, u32>;
type MyAutodiffBackend = burn::backend::Autodiff<MyBackend>;

// 100 samples × 10 features
let data: Vec<Vec<f64>> = generate_test_data(100, 10)
    .chunks(10)
    .map(|c| c.iter().map(|&x: &f32| x as f64).collect())
    .collect();

// Fit UMAP (default: 2-D output, Euclidean distance)
let config = UmapConfig::default();
let umap = fast_umap::Umap::<MyAutodiffBackend>::new(config);
let fitted = umap.fit(data.clone(), None);

// Get embedding
let embedding = fitted.embedding();
println!("Shape: {} × {}", embedding.len(), embedding[0].len());

Backend Choice: CPU vs GPU with Feature-Based Compilation

fast-umap supports both CPU and GPU backends with feature-based compilation for optimal flexibility:

GPU Backend (WGPU) - Primary Recommended Backend

For GPU-accelerated execution (requires WGPU-compatible GPU):

use cubecl::wgpu::WgpuRuntime;
use fast_umap::prelude::*;

type MyBackend = burn::backend::wgpu::CubeBackend<WgpuRuntime, f32, i32, u32>;
type MyAutodiffBackend = burn::backend::Autodiff<MyBackend>;

let config = UmapConfig::default();
let umap = fast_umap::Umap::<MyAutodiffBackend>::new(config);
let fitted = umap.fit(data, None);

Features:

✅ Full parametric UMAP with neural network training
✅ GPU acceleration via WGPU (Metal/Vulkan/DX12)
✅ Transform new data through trained model
✅ Custom GPU kernels for efficient computation
✅ Automatic differentiation for training

CPU Backend - Full Functionality with umap-rs Fallback

For CPU-only execution or when GPU is not available, fast-umap provides a complete CPU backend that uses classical UMAP computation:

#[cfg(feature = "cpu")]
{
    use fast_umap::cpu_backend::api as cpu_api;
    
    // Full CPU UMAP with umap-rs backend
    let config = UmapConfig::default();
    let fitted = cpu_api::fit_cpu(config, data, None);
    let embedding = fitted.embedding();
}

Features:

✅ Complete UMAP functionality using classical UMAP algorithm
✅ No GPU required
✅ Same API as GPU backend for consistency
✅ Full configuration support (n_components, n_neighbors, etc.)
✅ Excellent for environments without GPU access
❌ Cannot transform new data (classical UMAP limitation)

When to use CPU backend:

Development and testing environments without GPU
Cloud environments with CPU-only instances
Edge devices without GPU acceleration
Fallback when GPU drivers are unavailable

See cpu_training_demo example for complete CPU usage.

Feature-Based Compilation

fast-umap uses Cargo features to enable only the backends you need:

# GPU backend (default, includes gpu + verbose features)
cargo build --release

# CPU-only backend
cargo build --release --features cpu

# Minimal build (no backends, library mode only)
cargo build --release --no-default-features

# All features (development)
cargo build --release --features all

See feature_demo example for complete feature matrix.

Runtime Backend Selection

The backend_choice example demonstrates runtime backend selection:

# Run with GPU backend
cargo run --release --example backend_choice gpu

# Run with CPU backend  
cargo run --release --features cpu --example backend_choice cpu

CPU Testing Examples

For working CPU examples, see the test suite:

# Run CPU-based tests
cargo test --lib

The tests in tests/tests.rs demonstrate NdArray backend usage for CPU computation.

API Overview

The public API mirrors umap-rs:

Type	Description
`Umap<B>`	Main algorithm struct — `Umap::new(config)`
`FittedUmap<B>`	Fitted model — `.embedding()`, `.transform()`, `.into_embedding()`, `.config()`
`UmapConfig`	Configuration with nested `GraphParams` + `OptimizationParams`
`ManifoldParams`	`min_dist`, `spread` — control cluster tightness and separation
`GraphParams`	`n_neighbors`, `metric`, `normalized`, `minkowski_p`
`OptimizationParams`	`n_epochs`, `learning_rate`, `patience`, `timeout`, `verbose`, `neg_sample_rate`, …
`Metric`	`Euclidean`, `EuclideanKNN`, `Manhattan`, `Cosine`

Configuration

let config = UmapConfig {
    n_components: 2,
    hidden_sizes: vec![128],
    graph: GraphParams {
        n_neighbors: 15,
        metric: Metric::Euclidean,
        ..Default::default()
    },
    optimization: OptimizationParams {
        n_epochs: 200,
        learning_rate: 1e-3,
        patience: Some(50),
        verbose: true,
        ..Default::default()
    },
    ..Default::default()
};

Key parameters

Parameter	Default	Description
`n_components`	2	Output dimensionality (2-D or 3-D)
`hidden_sizes`	`[100, 100, 100]`	Neural network hidden layer sizes
`min_dist`	0.1	Min distance in embedding — smaller = tighter clusters
`spread`	1.0	Effective scale of embedded points
`n_neighbors`	15	KNN graph neighbours
`n_epochs`	200	Training epochs
`learning_rate`	1e-3	Adam step size
`batch_size`	1 000	Samples per batch
`penalty`	0.0	L2 weight decay
`metric`	`Euclidean`	Distance metric
`repulsion_strength`	1.0	Repulsion term weight
`neg_sample_rate`	5	Negative (repulsion) samples per positive edge per epoch
`cooldown_ms`	`0`	Sleep N ms between epochs to reduce GPU utilisation (0 = disabled)
`patience`	`None`	Early-stop epochs without improvement
`min_desired_loss`	`None`	Stop when loss ≤ threshold
`timeout`	`None`	Hard time limit (seconds)
`verbose`	`true`	Progress bar + loss plots

Transform new data

Because fast-umap is parametric (neural network), it can project new unseen data — something classical UMAP cannot do:

let fitted = umap.fit(training_data, None);

// Project new data through the trained model
let new_embedding = fitted.transform(new_data);

Examples

CPU Capabilities Demo

Demonstrates CPU backend capabilities for utility functions and tensor operations:

cargo run --release --example cpu_training_demo

Backend Choice — CPU vs GPU selection

Demonstrates how to choose between CPU (NdArray) and GPU (WGPU) backends at runtime:

# Run with GPU backend (default)
cargo run --release --example backend_choice gpu

# Run with CPU backend  
cargo run --release --example backend_choice cpu

Note about CPU backend: The CPU option shows backend selection structure. For utility functions, see cpu_training_demo. The GPU backend provides full parametric UMAP with neural network training and the ability to transform new data.

Simple — random data, 2-D embedding

cargo run --release --example simple

use cubecl::wgpu::WgpuRuntime;
use fast_umap::prelude::*;

fn main() {
    type MyBackend = burn::backend::wgpu::CubeBackend<WgpuRuntime, f32, i32, u32>;
    type MyAutodiffBackend = burn::backend::Autodiff<MyBackend>;

    let data: Vec<Vec<f64>> = (0..100 * 3)
        .map(|_| rand::rng().random::<f64>())
        .collect::<Vec<_>>()
        .chunks_exact(3)
        .map(|c| c.to_vec())
        .collect();

    let config = UmapConfig::default();
    let umap = fast_umap::Umap::<MyAutodiffBackend>::new(config);
    let fitted = umap.fit(data.clone(), None);

    let embedding = fitted.embedding();
    println!("Embedding shape: {} × {}", embedding.len(), embedding[0].len());

    // Transform new unseen data through the trained model
    let _new_embedding = fitted.transform(data);
}

Advanced — 1 000 samples, custom config

cargo run --release --example advanced

use cubecl::wgpu::WgpuRuntime;
use fast_umap::prelude::*;

fn main() {
    type MyBackend = burn::backend::wgpu::CubeBackend<WgpuRuntime, f32, i32, u32>;
    type MyAutodiffBackend = burn::backend::Autodiff<MyBackend>;

    let train_data: Vec<f32> = generate_test_data(1000, 100);
    let data: Vec<Vec<f32>> = train_data.chunks(100).map(|c| c.to_vec()).collect();

    let config = UmapConfig {
        n_components: 2,
        hidden_sizes: vec![100, 100, 100],
        graph: GraphParams {
            n_neighbors: 10,
            metric: Metric::Euclidean,
            ..Default::default()
        },
        optimization: OptimizationParams {
            n_epochs: 100,
            batch_size: 1000,
            learning_rate: 0.001,
            verbose: true,
            min_desired_loss: Some(0.001),
            timeout: Some(60),
            ..Default::default()
        },
        ..Default::default()
    };

    let umap = fast_umap::Umap::<MyAutodiffBackend>::new(config);
    let fitted = umap.fit(data, None);

    println!("Embedding: {} samples × {} dims",
        fitted.embedding().len(), fitted.embedding()[0].len());
}

MNIST — 10 000 hand-written digits projected to 2-D

cargo run --release --example mnist         # quick run (no figures)
./bench.sh --only mnist                     # generate figures

Downloads the MNIST dataset on first run (~12 MB), trains UMAP on 10K digits (784 features → 2-D).

2-D digit embedding (coloured by class)	Loss curve

let config = UmapConfig {
    n_components: 2,
    hidden_sizes: vec![256],
    graph: GraphParams {
        n_neighbors: 15,
        metric: Metric::Euclidean,
        ..Default::default()
    },
    optimization: OptimizationParams {
        n_epochs: 1_000,
        learning_rate: 1e-3,
        patience: Some(100),
        repulsion_strength: 3.0,
        penalty: 1e-6,
        verbose: true,
        ..Default::default()
    },
    ..Default::default()
};

let umap = fast_umap::Umap::<MyAutodiffBackend>::new(config);
let fitted = umap.fit(data, Some(labels));

Generating Figures

All figures are generated by the benchmark suite, not by the examples. Examples are lightweight smoke tests that verify correctness without writing any files.

# Run all benchmarks and generate all figures
./bench.sh

# Run only specific benchmarks
./bench.sh --only comparison          # fast-umap vs umap-rs chart
./bench.sh --only mnist               # MNIST embedding + loss curve
./bench.sh --only hardware            # CPU + GPU micro-benchmarks
./bench.sh --only mnist comparison    # combine multiple

# Skip MNIST (saves ~70s)
./bench.sh --skip-mnist

# Include Criterion statistical suite
./bench.sh --criterion

What generates what

Benchmark	Output
`comparison`	`figures/crate_comparison.{svg,json,md}`
`mnist`	`figures/mnist.png`, `figures/losses_model.png`
`hardware`	`benches/results/{cpu,gpu,comparison}_*.{svg,md}`
`criterion`	`target/criterion/` (HTML reports)

Run examples (no figures)

./run_all.sh                  # all examples
./run_all.sh --skip-mnist     # skip MNIST download

CPU Capabilities Demo

Demonstrates CPU backend capabilities:

# Run CPU demo (requires cpu feature)
cargo run --release --features cpu --example cpu_training_demo

Developer Experience Improvements

The feature-based compilation system provides:

✅ Faster Compilation

Compile only what you need: cargo build --release --features cpu
Minimal builds: cargo build --release --no-default-features
Feature-specific examples and documentation

✅ Smaller Binaries

GPU-only: exclude CPU dependencies (save ~5MB)
CPU-only: exclude GPU dependencies (save ~10MB)
Library mode: minimal footprint for integration

✅ Clear Feature Matrix

Feature	Description	When to Use
`gpu`	GPU backend (WGPU)	Production, large datasets
`cpu`	CPU backend (umap-rs)	CPU-only environments
`verbose`	Progress output	Development, debugging
`plotters`	Visualization	Exploration, analysis
`all`	Everything	Development, testing

✅ Better Error Messages

Clear feature requirements at compile time
Graceful runtime fallback handling
Comprehensive validation

See feature_demo example for complete feature guide.

Architecture

fast-umap uses a parametric approach — a small feed-forward neural network is trained with the UMAP cross-entropy loss:

attraction  =  mean_{k-NN edges}      [ −log q_ij ]
repulsion   =  mean_{negative samples} [ −log (1 − q_ij) ]
loss        =  attraction  +  repulsion_strength × repulsion

where q_ij = 1 / (1 + a · d_ij^(2b)) is the UMAP kernel applied to embedding distances (a and b are fitted from min_dist / spread).

Training pipeline

Input data [n, features]
    │
    ▼
GPU pairwise distance → KNN graph (one-time O(n²) cost)
    │
    ▼
┌─── Per epoch (cost: O(min(n·k, 50K))) ───┐
│  Forward pass: data → neural net → [n, 2] │
│  Edge subsampling from KNN graph           │
│  Negative sampling (random non-neighbors)  │
│  UMAP cross-entropy loss                   │
│  Backward pass + Adam optimizer step       │
└────────────────────────────────────────────┘
    │
    ▼
FittedUmap with .embedding() and .transform()

Modules

Module	Description
[`model`]	`UMAPModel` neural network and config builder
[`train`]	Training loop, `UmapConfig`, sparse training, loss computation
[`chart`]	2-D scatter plots and loss curves (plotters)
[`utils`]	Data generation, tensor conversion, normalisation
[`kernels`]	Custom CubeCL GPU kernels (Euclidean distance, k-NN)
[`backend`]	Backend trait extension for custom kernel dispatch
[`distances`]	CPU-side distance functions (Euclidean, cosine, Minkowski…)
[`prelude`]	Re-exports of the most commonly used items

Legacy API

The original UMAP struct and train() function are still available for backward compatibility:

use fast_umap::prelude::*;

// Legacy one-liner
let umap: fast_umap::UMAP<MyAutodiffBackend> = umap(data.clone());
let embedding = umap.transform(data);

// Legacy manual training
let (model, losses, best_loss) = fast_umap::train::train(
    "my_run", model, num_samples, num_features,
    train_data, &config, device, exit_rx, None,
);

Note: The legacy API uses the dense O(n²) training path. Use the new Umap::new(config).fit(data) API for the optimized sparse training path.

Testing

All tests run on CPU (burn::backend::NdArray) — no GPU required.

cargo test

Category	What is covered
`normalize_data`	correctness, zero-mean/unit-std, constant columns
`format_duration`	zero, seconds, minutes, hours
`generate_test_data`	shape, bounds `[0, 1)`
`tensor_convert`	round-trip `Vec → Tensor → Vec`, NaN → 0
`normalize_tensor`	output in `[0, 1]`, constant-input safety
`layer_normalize`	no NaN, shape preserved
`UMAPModelConfigBuilder`	defaults and custom values
`TrainingConfig`	builder, `Metric::from(&str)`, invalid-metric panic
`UMAPModel`	2-D / 3-D output, deep network, determinism, no NaN
Distance math	Euclidean self=0, 3-4-5 triangle, symmetry; Manhattan

Micro-benchmarks

Reproduce: ./bench.sh --only hardware

Full detail files: cpu_apple_m3_max.md · gpu_apple_m3_max.md · comparison_apple_m3_max.md · cpu_apple_silicon_aarch64.md

CPU — Apple M3 Max (NdArray backend)

CPU benchmark chart

Benchmark	Input	Min	Mean	Max
`normalize_data`	100×10	345 µs	517 µs	986 µs
`normalize_data`	500×30	1.92 ms	2.31 ms	2.77 ms
`normalize_data`	1 000×50	4.35 ms	4.78 ms	5.79 ms
`normalize_data`	5 000×100	16.4 ms	18.2 ms	19.4 ms
`generate_test_data`	100×10	3.62 µs	4.03 µs	5.92 µs
`generate_test_data`	500×30	56.3 µs	57.8 µs	61.9 µs
`generate_test_data`	1 000×50	246 µs	258 µs	288 µs
`generate_test_data`	5 000×100	2.46 ms	2.48 ms	2.54 ms
`tensor_convert`	100×10	5.08 µs	5.16 µs	5.33 µs
`tensor_convert`	500×30	32.4 µs	33.0 µs	35.8 µs
`tensor_convert`	1 000×50	74.6 µs	78.1 µs	89.5 µs
`model_forward`	16s×10f [32]→2	21.4 µs	34.5 µs	55.8 µs
`model_forward`	64s×50f [64]→2	26.7 µs	34.2 µs	51.2 µs
`model_forward`	128s×50f [128]→2	55.9 µs	58.6 µs	64.2 µs
`model_forward`	64s×100f [128,64]→3	70.0 µs	80.3 µs	106 µs
`model_forward`	256s×100f [256,128]→2	279 µs	293 µs	310 µs
`normalize_tensor`	n=64	1.88 µs	1.95 µs	2.29 µs
`normalize_tensor`	n=512	2.79 µs	2.86 µs	3.00 µs
`normalize_tensor`	n=4 096	9.50 µs	9.59 µs	10.3 µs
`normalize_tensor`	n=32 768	70.0 µs	70.3 µs	70.7 µs
`layer_normalize`	32×16	3.50 µs	3.62 µs	3.96 µs
`layer_normalize`	128×64	19.7 µs	20.0 µs	20.9 µs
`layer_normalize`	512×128	115 µs	117 µs	130 µs
`layer_normalize`	1 000×256	412 µs	420 µs	454 µs

GPU — Apple M3 Max (WGPU / Metal)

GPU benchmark chart

Benchmark	Input	Min	Mean	Max
`model_forward`	16s×10f [32]→2	408 µs	617 µs	894 µs
`model_forward`	64s×50f [64]→2	430 µs	481 µs	776 µs
`model_forward`	128s×50f [128]→2	432 µs	475 µs	576 µs
`model_forward`	64s×100f [128,64]→3	549 µs	688 µs	1.82 ms
`model_forward`	256s×100f [256,128]→2	631 µs	691 µs	828 µs
`model_forward`	512s×100f [256,128]→2	926 µs	1.08 ms	1.42 ms
`normalize_tensor`	n=512	572 µs	695 µs	1.28 ms
`normalize_tensor`	n=4 096	590 µs	662 µs	821 µs
`normalize_tensor`	n=32 768	629 µs	712 µs	883 µs
`normalize_tensor`	n=262 144	1.08 ms	1.12 ms	1.22 ms
`layer_normalize`	128×64	437 µs	471 µs	609 µs
`layer_normalize`	512×128	467 µs	500 µs	648 µs
`layer_normalize`	1 000×256	617 µs	662 µs	777 µs
`layer_normalize`	4 000×512	1.81 ms	1.93 ms	2.15 ms

CPU vs GPU — Apple M3 Max

CPU vs GPU comparison

Benchmark	Input	CPU	GPU	Speedup
`model_forward`	16s×10f [32]→2	34.5 µs	617 µs	0.06× (CPU faster)
`model_forward`	64s×50f [64]→2	34.2 µs	481 µs	0.07× (CPU faster)
`model_forward`	128s×50f [128]→2	58.6 µs	475 µs	0.12× (CPU faster)
`model_forward`	64s×100f [128,64]→3	80.3 µs	688 µs	0.12× (CPU faster)
`model_forward`	256s×100f [256,128]→2	293 µs	691 µs	0.42× (CPU faster)
`normalize_tensor`	n=512	2.86 µs	695 µs	0.00× (CPU faster)
`normalize_tensor`	n=4 096	9.59 µs	662 µs	0.01× (CPU faster)
`normalize_tensor`	n=32 768	70.3 µs	712 µs	0.10× (CPU faster)
`layer_normalize`	128×64	20.0 µs	471 µs	0.04× (CPU faster)
`layer_normalize`	512×128	117 µs	500 µs	0.23× (CPU faster)
`layer_normalize`	1 000×256	420 µs	662 µs	0.63× (CPU faster)

Note: WGPU/Metal has a fixed dispatch overhead of ~400–700 µs per kernel call. For the small model sizes above, that overhead dominates. The GPU wins in full UMAP training loops over thousands of samples where operations are chained without intermediate CPU readbacks.

Running benchmarks

# All benchmarks (hardware + comparison + MNIST)
./bench.sh

# + Criterion statistics (~5 min)
./bench.sh --criterion

# Just hardware micro-benchmarks
./bench.sh --only hardware

Roadmap

MNIST dataset example with intermediate plots
Charting behind a feature flag
Labels in plots
Batching + accumulated gradient
CubeCL kernels for distance computation
Hyperparameter testbench (patience vs n_features vs epochs …)
Unit tests (36) and hardware benchmarks
New API mirroring umap-rs (Umap, FittedUmap, UmapConfig)
Sparse training with edge subsampling + negative sampling
Crate comparison benchmark (fast-umap vs umap-rs)
PCA warm-start for initial embedding
Approximate KNN (NN-descent) for datasets > 50K

License

MIT — see LICENSE.

Copyright

2024-2026, Eugene Hauptmann

Citations

If you use fast-umap in research or a project, please cite the original UMAP paper, this repository, and acknowledge the Burn and CubeCL frameworks:

fast-umap

@software{hauptmann2024fastumap,
  title   = {fast-umap: GPU-Accelerated UMAP in Rust},
  author  = {Hauptmann, Eugene},
  year    = {2024},
  url     = {https://github.com/eugenehp/fast-umap},
  version = {1.2.2}
}

Hauptmann, E. (2024). fast-umap: GPU-Accelerated UMAP in Rust (v1.2.2). https://github.com/eugenehp/fast-umap

UMAP algorithm

@article{mcinnes2018umap,
  title   = {{UMAP}: Uniform Manifold Approximation and Projection for Dimension Reduction},
  author  = {McInnes, Leland and Healy, John and Melville, James},
  journal = {arXiv preprint arXiv:1802.03426},
  year    = {2018},
  url     = {https://arxiv.org/abs/1802.03426}
}

McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv:1802.03426. https://arxiv.org/abs/1802.03426

Burn deep-learning framework

@software{burn2024,
  title   = {Burn: A Flexible and Comprehensive Deep Learning Framework in Rust},
  author  = {{Tracel AI}},
  year    = {2024},
  url     = {https://github.com/tracel-ai/burn}
}

CubeCL GPU compute framework

@software{cubecl2024,
  title  = {CubeCL: Multi-platform GPU Compute Language for Rust},
  author = {{Tracel AI}},
  year   = {2024},
  url    = {https://github.com/tracel-ai/cubecl}
}

Thank you

Inspired by the original UMAP paper.

fast-umap 1.5.0

fast-umap

Highlights

Performance — fast-umap vs umap-rs

What's New (v1.2.2)

Features

Installation

Quick Start

Backend Choice: CPU vs GPU with Feature-Based Compilation

GPU Backend (WGPU) - Primary Recommended Backend

CPU Backend - Full Functionality with umap-rs Fallback

Feature-Based Compilation

Runtime Backend Selection

CPU Testing Examples

API Overview

Configuration

Key parameters

Transform new data

Examples

CPU Capabilities Demo

Backend Choice — CPU vs GPU selection

Simple — random data, 2-D embedding

Advanced — 1 000 samples, custom config

MNIST — 10 000 hand-written digits projected to 2-D

Generating Figures

What generates what

Run examples (no figures)

CPU Capabilities Demo

Developer Experience Improvements

Architecture

Training pipeline

Modules

Legacy API

Testing

Micro-benchmarks

CPU — Apple M3 Max (NdArray backend)

GPU — Apple M3 Max (WGPU / Metal)

CPU vs GPU — Apple M3 Max

Running benchmarks

Roadmap

License

Copyright

Citations

fast-umap

UMAP algorithm

Burn deep-learning framework

CubeCL GPU compute framework

Thank you