fast-umap

UMAP (Uniform Manifold Approximation and Projection) in Rust — GPU-accelerated via burn + CubeCL.

See docs.rs for the full API reference.

What's New (v1.0.0)

Area	Change
burn upgrade	`0.18` → `0.20.1`; cubecl `0.6` → `0.9`
API fixes	`Backend::seed` now takes `(&device, seed)`; `as_tensor_arg` no longer takes a generic type param; `NodeID` renamed `NodeId`
Kernel fixes	`ABSOLUTE_POS_X/Y` cast to `usize` for shape indexing; all `for i in 0..k` loops use `k as usize`
GPU fix	`normalize_tensor` used `.to_vec::<bool>()` which panics on WGPU (booleans stored as `u32`); replaced with f32 arithmetic comparison
Warning fixes	All 4 `unused Result` warnings from kernel launches resolved with `.expect()`
Tests	36 unit tests added (`cargo test`), all CPU-only via NdArray backend
Benchmarks	`cargo run --release --bin bench_report` — auto-detects CPU & GPU, writes hardware-tagged `.md` + `.svg` + CPU vs GPU comparison

Features

Dimensionality Reduction — projects high-dimensional data to 2-D or 3-D for visualization
GPU-accelerated kernels — custom CubeCL kernels for Euclidean pairwise distance and KNN, compiled for WGPU (Metal / Vulkan / DX12)
Automatic differentiation — full autograd through the custom kernels via burn's autodiff backend
Flexible architecture — configurable hidden layers, output dims, distance metric, batch size, learning rate, early stopping, timeout
CPU fallback — all model code runs on NdArray (no GPU required for inference or tests)
36 unit tests — covering normalization, tensor conversion, model shape, distance math
Hardware-tagged benchmarks — CPU and GPU timings saved as Markdown + SVG, including a CPU vs GPU comparison chart

Roadmap

MNIST dataset example with intermediate plots
Charting behind a feature flag
Labels in plots
Batching + accumulated gradient
CubeCL kernels for distance computation
Hyperparameter testbench (patience vs n_features vs epochs …)
Unit tests (36) and hardware benchmarks
PCA warm-start for initial embedding

Testing

All tests run on CPU (burn::backend::NdArray) — no GPU required.

cargo test

Category	What is covered
`normalize_data`	correctness, zero-mean/unit-std, constant columns
`format_duration`	zero, seconds, minutes, hours
`generate_test_data`	shape, bounds `[0, 1)`
`tensor_convert`	round-trip `Vec → Tensor → Vec`, NaN → 0
`normalize_tensor`	output in `[0, 1]`, constant-input safety
`layer_normalize`	no NaN, shape preserved
`UMAPModelConfigBuilder`	defaults and custom values
`TrainingConfig`	builder, `Metric::from(&str)`, invalid-metric panic
`UMAPModel`	2-D / 3-D output, deep network, determinism, no NaN
Distance math	Euclidean self=0, 3-4-5 triangle, symmetry; Manhattan

Benchmarks

Reproduce: cargo run --release --bin bench_report

Full detail files: cpu_apple_m3_max.md · gpu_apple_m3_max.md · comparison_apple_m3_max.md · cpu_apple_silicon_aarch64.md

CPU — Apple M3 Max (NdArray backend)

CPU benchmark chart

Benchmark	Input	Min	Mean	Max
`normalize_data`	100×10	329 µs	450 µs	797 µs
`normalize_data`	500×30	1.88 ms	2.22 ms	2.98 ms
`normalize_data`	1 000×50	3.68 ms	4.64 ms	5.71 ms
`normalize_data`	5 000×100	15.8 ms	17.9 ms	19.4 ms
`generate_test_data`	100×10	3.58 µs	4.05 µs	14.5 µs
`generate_test_data`	500×30	56.0 µs	57.9 µs	72.5 µs
`generate_test_data`	1 000×50	189 µs	192 µs	203 µs
`generate_test_data`	5 000×100	1.89 ms	1.91 ms	1.97 ms
`tensor_convert`	100×10	4.96 µs	5.02 µs	5.17 µs
`tensor_convert`	500×30	29.7 µs	31.2 µs	40.6 µs
`tensor_convert`	1 000×50	70.8 µs	75.7 µs	90.4 µs
`model_forward`	16s×10f [32]→2	21.8 µs	36.0 µs	58.6 µs
`model_forward`	64s×50f [64]→2	27.9 µs	34.4 µs	53.5 µs
`model_forward`	128s×50f [128]→2	54.3 µs	57.5 µs	62.9 µs
`model_forward`	64s×100f [128,64]→3	71.0 µs	74.7 µs	85.0 µs
`model_forward`	256s×100f [256,128]→2	243 µs	248 µs	256 µs
`normalize_tensor`	n=64	1.96 µs	2.03 µs	2.17 µs
`normalize_tensor`	n=512	2.79 µs	2.92 µs	3.17 µs
`normalize_tensor`	n=4 096	9.50 µs	9.70 µs	14.6 µs
`normalize_tensor`	n=32 768	69.7 µs	70.3 µs	81.1 µs
`layer_normalize`	32×16	3.50 µs	3.61 µs	3.75 µs
`layer_normalize`	128×64	17.9 µs	18.1 µs	19.1 µs
`layer_normalize`	512×128	115 µs	117 µs	126 µs
`layer_normalize`	1 000×256	413 µs	416 µs	427 µs

CPU — Apple Silicon aarch64 (NdArray backend)

CPU benchmark chart

Benchmark	Input	Min	Mean	Max
`normalize_data`	100×10	8.63 ms	9.85 ms	11.4 ms
`normalize_data`	500×30	36.8 ms	41.5 ms	51.4 ms
`normalize_data`	1 000×50	67.2 ms	72.8 ms	78.9 ms
`normalize_data`	5 000×100	173 ms	178 ms	183 ms
`generate_test_data`	100×10	5.08 µs	5.82 µs	10.4 µs
`generate_test_data`	500×30	60.8 µs	62.2 µs	75.8 µs
`generate_test_data`	1 000×50	198 µs	204 µs	269 µs
`generate_test_data`	5 000×100	2.26 ms	2.59 ms	3.20 ms
`tensor_convert`	100×10	20.3 µs	22.6 µs	41.5 µs
`tensor_convert`	500×30	129 µs	138 µs	159 µs
`tensor_convert`	1 000×50	328 µs	354 µs	404 µs
`model_forward`	16s×10f [32]→2	34.7 µs	121 µs	2.40 ms
`model_forward`	64s×50f [64]→2	45.7 µs	111 µs	1.35 ms
`model_forward`	128s×50f [128]→2	162 µs	263 µs	934 µs
`model_forward`	64s×100f [128,64]→3	291 µs	315 µs	362 µs
`model_forward`	256s×100f [256,128]→2	790 µs	857 µs	1.03 ms
`normalize_tensor`	n=64	4.46 µs	6.10 µs	21.6 µs
`normalize_tensor`	n=512	3.54 µs	4.16 µs	32.1 µs
`normalize_tensor`	n=4 096	12.2 µs	13.8 µs	27.8 µs
`normalize_tensor`	n=32 768	176 µs	195 µs	227 µs
`layer_normalize`	32×16	4.08 µs	4.95 µs	21.0 µs
`layer_normalize`	128×64	64.8 µs	75.8 µs	101 µs
`layer_normalize`	512×128	370 µs	410 µs	455 µs
`layer_normalize`	1 000×256	1.44 ms	1.56 ms	1.90 ms

GPU — Apple M3 Max (WGPU / Metal)

GPU benchmark chart

Benchmark	Input	Min	Mean	Max
`model_forward`	16s×10f [32]→2	419 µs	565 µs	770 µs
`model_forward`	64s×50f [64]→2	424 µs	466 µs	585 µs
`model_forward`	128s×50f [128]→2	424 µs	474 µs	551 µs
`model_forward`	64s×100f [128,64]→3	565 µs	637 µs	825 µs
`model_forward`	256s×100f [256,128]→2	638 µs	675 µs	724 µs
`model_forward`	512s×100f [256,128]→2	921 µs	1.10 ms	1.32 ms
`normalize_tensor`	n=512	560 µs	660 µs	957 µs
`normalize_tensor`	n=4 096	555 µs	650 µs	828 µs
`normalize_tensor`	n=32 768	642 µs	711 µs	844 µs
`normalize_tensor`	n=262 144	1.06 ms	1.13 ms	1.23 ms
`layer_normalize`	128×64	365 µs	419 µs	648 µs
`layer_normalize`	512×128	407 µs	503 µs	1.46 ms
`layer_normalize`	1 000×256	568 µs	715 µs	1.67 ms
`layer_normalize`	4 000×512	1.75 ms	2.12 ms	3.01 ms

CPU vs GPU — Apple M3 Max

CPU vs GPU comparison

Benchmark	Input	CPU	GPU	Speedup
`model_forward`	16s×10f [32]→2	36.0 µs	565 µs	0.06× (CPU faster)
`model_forward`	64s×50f [64]→2	34.4 µs	466 µs	0.07× (CPU faster)
`model_forward`	128s×50f [128]→2	57.5 µs	474 µs	0.12× (CPU faster)
`model_forward`	64s×100f [128,64]→3	74.7 µs	637 µs	0.12× (CPU faster)
`model_forward`	256s×100f [256,128]→2	248 µs	675 µs	0.37× (CPU faster)
`normalize_tensor`	n=512	2.92 µs	660 µs	0.00× (CPU faster)
`normalize_tensor`	n=4 096	9.70 µs	650 µs	0.01× (CPU faster)
`normalize_tensor`	n=32 768	70.3 µs	711 µs	0.10× (CPU faster)
`layer_normalize`	128×64	18.1 µs	419 µs	0.04× (CPU faster)
`layer_normalize`	512×128	117 µs	503 µs	0.23× (CPU faster)
`layer_normalize`	1 000×256	416 µs	715 µs	0.58× (CPU faster)

Why is CPU faster for these sizes?

WGPU/Metal has a fixed dispatch overhead of ~400–700 µs per kernel call. For the small model sizes benchmarked above that overhead dominates over the actual compute. The GPU wins when:

Batch sizes are large (≥ 10 000 samples per forward pass)
Models are wide/deep (≥ 1 024 hidden units)
Operations are chained without intermediate CPU readbacks
The custom CubeBackend kernels run on large N×N pairwise distance matrices (cost grows O(N²), quickly justifying dispatch overhead)

In a full UMAP training loop over thousands of samples the GPU path is significantly faster; the numbers above reflect isolated micro-benchmarks with a forced CPU↔GPU sync after every operation.

Running benchmarks on your Mac

# 1. Prerequisites
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh   # Rust
xcode-select --install                                             # Metal / Xcode CLT

# 2. Run — writes hardware-tagged files to benches/results/
cargo run --release --bin bench_report

# 3. Open the comparison chart
open benches/results/comparison.svg

Or use the convenience script:

./bench.sh             # hardware report only  (~1 min)
./bench.sh --criterion # + Criterion stats     (~5 min)

Output files written automatically:

benches/results/
  cpu_<chip>.md / .svg            CPU timings
  gpu_<chip>.md / .svg            GPU timings  (when GPU available)
  comparison_<chip>.md / .svg     CPU vs GPU, speedup ratios
  benchmark_results.md / .svg     copy of latest CPU run

Installation

cargo add fast-umap

Cargo.toml:

[dependencies]
fast-umap  = "1.0.0"
burn       = { version = "0.20.1", features = ["wgpu", "autodiff", "autotune"] }
cubecl     = { version = "0.9.0",  features = ["wgpu"] }

Quick start

use cubecl::wgpu::WgpuRuntime;
use fast_umap::prelude::*;

fn main() {
    type MyBackend = burn::backend::wgpu::CubeBackend<WgpuRuntime, f32, i32, u32>;
    type MyAutodiffBackend = burn::backend::Autodiff<MyBackend>;

    // 100 samples × 10 features, randomly generated
    let data: Vec<Vec<f64>> = generate_test_data(100, 10)
        .chunks(10)
        .map(|c| c.iter().map(|&x: &f32| x as f64).collect())
        .collect();

    // Train UMAP and project to 2-D
    let umap: fast_umap::UMAP<MyAutodiffBackend> = umap(data.clone());
    let embedding = umap.transform(data);

    // Save plot.png
    chart_vector(embedding, None, None);
}

Model Configuration

let model_config = UMAPModelConfigBuilder::default()
    .input_size(50)                      // input dimensionality
    .hidden_sizes(vec![128, 64])         // one or more hidden layers
    .output_size(2)                      // 2-D or 3-D embedding
    .build()
    .unwrap();

Training

use cubecl::wgpu::WgpuRuntime;
use fast_umap::{model::*, prelude::*, train::{train, TrainingConfig}, utils::*};

type MyBackend        = burn::backend::wgpu::CubeBackend<WgpuRuntime, f32, i32, u32>;
type MyAutodiffBackend = burn::backend::Autodiff<MyBackend>;

let device = burn::backend::wgpu::WgpuDevice::default();
MyAutodiffBackend::seed(&device, 42);

let num_samples  = 1_000;
let num_features = 100;
let train_data: Vec<f32> = generate_test_data(num_samples, num_features);

let model: UMAPModel<MyAutodiffBackend> = UMAPModel::new(
    &UMAPModelConfigBuilder::default()
        .input_size(num_features)
        .hidden_sizes(vec![128, 64])
        .output_size(2)
        .build().unwrap(),
    &device,
);

let config = TrainingConfig::builder()
    .with_epochs(200)
    .with_batch_size(256)
    .with_learning_rate(1e-3)
    .with_beta1(0.9)
    .with_beta2(0.999)
    .with_k_neighbors(10)
    .with_metric(Metric::Euclidean.into())
    .with_verbose(true)
    .with_patience(Some(20))
    .with_min_desired_loss(Some(1e-3))
    .with_timeout(Some(120))
    .build()
    .unwrap();

let (exit_tx, exit_rx) = crossbeam_channel::unbounded();
ctrlc::set_handler(move || exit_tx.send(()).unwrap()).unwrap();

let (model, losses, _) = train(
    "my_run",
    model,
    num_samples,
    num_features,
    train_data.clone(),
    &config,
    device.clone(),
    exit_rx,
);

// Inference
let model = model.valid();
let input = convert_vector_to_tensor(train_data, num_samples, num_features, &device);
let embedding = model.forward(input);  // shape [num_samples, 2]

Key training parameters

Parameter	Default	Description
`epochs`	100	Training epochs
`batch_size`	1 000	Samples per batch
`learning_rate`	1e-3	Adam step size
`beta1` / `beta2`	0.9 / 0.999	Adam moment decay
`penalty`	0.0	L2 weight decay
`k_neighbors`	10	KNN graph neighbours
`metric`	`Euclidean`	`Euclidean` · `EuclideanKNN` · `Manhattan` · `Cosine`
`patience`	`None`	Early-stop epochs without improvement
`min_desired_loss`	`None`	Stop when loss ≤ threshold
`timeout`	`None`	Hard time limit (seconds)
`verbose`	`true`	Progress bar

Examples

Simple — random data, 2-D embedding

cargo run --example simple

use cubecl::wgpu::WgpuRuntime;
use fast_umap::prelude::*;

fn main() {
    type MyBackend        = burn::backend::wgpu::CubeBackend<WgpuRuntime, f32, i32, u32>;
    type MyAutodiffBackend = burn::backend::Autodiff<MyBackend>;

    let num_samples  = 100;
    let num_features = 3;

    let mut rng = rand::rng();
    let data: Vec<Vec<f64>> = (0..num_samples)
        .map(|_| (0..num_features).map(|_| rng.random::<f64>()).collect())
        .collect();

    let umap: fast_umap::UMAP<MyAutodiffBackend> = umap(data.clone());
    let embedding = umap.transform(data);

    // Saves plot.png
    chart_vector(embedding, None, None);
}

Advanced — 1 000 samples, 100 features, custom config

cargo run --example advanced

use cubecl::wgpu::WgpuRuntime;
use fast_umap::{chart, model::*, prelude::*, train::train, utils::*};
use crossbeam_channel::unbounded;

fn main() {
    let (exit_tx, exit_rx) = unbounded();
    ctrlc::set_handler(move || exit_tx.send(()).unwrap()).unwrap();

    type MyBackend        = burn::backend::wgpu::CubeBackend<WgpuRuntime, f32, i32, u32>;
    type MyAutodiffBackend = burn::backend::Autodiff<MyBackend>;

    let device = burn::backend::wgpu::WgpuDevice::default();
    MyAutodiffBackend::seed(&device, 9999);

    let num_samples  = 1_000;
    let num_features = 100;
    let train_data: Vec<f32> = generate_test_data(num_samples, num_features);

    let model: UMAPModel<MyAutodiffBackend> = UMAPModel::new(
        &UMAPModelConfigBuilder::default()
            .input_size(num_features)
            .hidden_sizes(vec![100, 100, 100])
            .output_size(2)
            .build().unwrap(),
        &device,
    );

    let config = TrainingConfig::builder()
        .with_epochs(100)
        .with_batch_size(1000)
        .with_learning_rate(1e-3)
        .with_beta1(0.9)
        .with_beta2(0.999)
        .with_verbose(true)
        .with_metric(Metric::Euclidean.into())
        .with_k_neighbors(10)
        .with_min_desired_loss(Some(1e-3))
        .with_timeout(Some(60))
        .build().unwrap();

    let (model, losses, _) = train(
        "advanced", model, num_samples, num_features,
        train_data.clone(), &config, device.clone(), exit_rx,
    );

    let model = model.valid();
    let input = convert_vector_to_tensor(train_data, num_samples, num_features, &device);

    // Saves losses.png + plot.png
    chart::plot_loss(losses, "losses.png").unwrap();
    chart::chart_tensor(model.forward(input), None, None);
}

MNIST — 70 000 hand-written digits projected to 2-D

cargo run --example mnist

Downloads the MNIST dataset on first run (~12 MB), trains UMAP, saves:

Output	File
2-D digit embedding (coloured by class)
MNIST loss curve

Regenerating example images

The images above are generated by the examples themselves (GPU required). To refresh them on your Mac:

# Regenerate all example output images
./generate_images.sh

Or individually:

cargo run --release --example simple       # → plot.png
cargo run --release --example advanced     # → plot.png + losses.png
cargo run --release --example mnist        # → mnist.png + losses_mnist.png

License

MIT — see LICENSE.

Copyright

2024-2026, Eugene Hauptmann

Citations

If you use fast-umap in research or a project, please cite the original UMAP paper, this repository, and acknowledge the Burn and CubeCL frameworks:

fast-umap

@software{hauptmann2024fastumap,
  title   = {fast-umap: GPU-Accelerated UMAP in Rust},
  author  = {Hauptmann, Eugene},
  year    = {2024},
  url     = {https://github.com/eugenehp/fast-umap},
  version = {1.0.0}
}

Hauptmann, E. (2024). fast-umap: GPU-Accelerated UMAP in Rust (v1.0.0). https://github.com/eugenehp/fast-umap

UMAP algorithm

@article{mcinnes2018umap,
  title   = {{UMAP}: Uniform Manifold Approximation and Projection for Dimension Reduction},
  author  = {McInnes, Leland and Healy, John and Melville, James},
  journal = {arXiv preprint arXiv:1802.03426},
  year    = {2018},
  url     = {https://arxiv.org/abs/1802.03426}
}

McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv:1802.03426. https://arxiv.org/abs/1802.03426

Burn deep-learning framework

@software{burn2024,
  title   = {Burn: A Flexible and Comprehensive Deep Learning Framework in Rust},
  author  = {{Tracel AI}},
  year    = {2024},
  url     = {https://github.com/tracel-ai/burn}
}

CubeCL GPU compute framework

@software{cubecl2024,
  title  = {CubeCL: Multi-platform GPU Compute Language for Rust},
  author = {{Tracel AI}},
  year   = {2024},
  url    = {https://github.com/tracel-ai/cubecl}
}

Thank you

Inspired by the original UMAP paper.

fast-umap 1.0.0

fast-umap

What's New (v1.0.0)

Features

Roadmap

Testing

Benchmarks

CPU — Apple M3 Max (NdArray backend)

CPU — Apple Silicon aarch64 (NdArray backend)

GPU — Apple M3 Max (WGPU / Metal)

CPU vs GPU — Apple M3 Max

Running benchmarks on your Mac

Installation

Quick start

Model Configuration

Training

Key training parameters

Examples

Simple — random data, 2-D embedding

Advanced — 1 000 samples, 100 features, custom config

MNIST — 70 000 hand-written digits projected to 2-D

Regenerating example images

License

Copyright

Citations

fast-umap

UMAP algorithm

Burn deep-learning framework

CubeCL GPU compute framework

Thank you