fast-umap 1.5.0

Configurable UMAP (Uniform Manifold Approximation and Projection) in Rust
Documentation

fast-umap

GPU-accelerated parametric UMAP (Uniform Manifold Approximation and Projection) in Rust, built on burn + CubeCL.

See docs.rs for the full API reference.


Highlights

  • Up to 4.7× faster than umap-rs on datasets ≥ 10 000 samples (see benchmarks below)
  • Parametric — trains a neural network, so you can transform() new unseen data instantly
  • GPU-accelerated — custom CubeCL kernels for pairwise distance and KNN, compiled for Metal / Vulkan / DX12 via WGPU
  • API mirrors umap-rs — drop-in replacement with Umap::new(config).fit(data)
  • Automatic differentiation — full autograd through custom GPU kernels
  • CPU fallback — runs on NdArray backend (no GPU required for inference or tests)

Performance — fast-umap vs umap-rs

Benchmarked on Apple M3 Max. Both crates receive identical random data. fast-umap runs 50 epochs (parametric, GPU); umap-rs runs 200 epochs (classical SGD, CPU). Total time includes data prep + KNN + fit + extract.

Crate comparison chart

Dataset fast-umap umap-rs Speedup
500 × 50 0.84s 0.08s 0.10× (umap-rs faster)
1 000 × 50 2.19s 0.12s 0.05× (umap-rs faster)
2 000 × 100 3.65s 0.44s 0.12× (umap-rs faster)
5 000 × 100 6.75s 2.31s 0.34× (umap-rs faster)
10 000 × 100 5.93s 8.68s 1.5× faster 🚀
20 000 × 100 7.32s 34.10s 4.7× faster 🚀

Crossover ≈ 10 000 samples. Below that, umap-rs wins on raw CPU efficiency for small data. Above it, fast-umap pulls ahead and the gap widens with dataset size — umap-rs's brute-force KNN scales O(n²) while fast-umap's per-epoch cost is capped.

Why fast-umap wins at scale:

Technique Effect
Sparse edge-based loss O(n·k) per epoch instead of O(n²)
Edge subsampling Caps edges/epoch at 50K regardless of n
Pre-batched negative samples on GPU Zero per-epoch CPU→GPU transfers
Fused index gather 2 GPU select() calls instead of 4
Async loss readback GPU→CPU sync every 5 epochs, not every epoch
In-memory checkpointing No disk I/O during training
GPU-accelerated KNN Full n×n pairwise distance on GPU (one-time cost)

Reproduce:

./bench.sh --only comparison

Or run all benchmarks at once (hardware + comparison + MNIST):

./bench.sh

What's New (v1.2.2)

See CHANGELOG.md for the full release history.

Area Change
GPU cooldown New cooldown_ms parameter — sleep N ms between epochs to prevent 100 % GPU utilisation; default 0 (no change to existing behaviour)
UMAP kernel Proper q = 1/(1 + a·d^(2b)) kernel with a, b fitted from min_dist/spread — replaces fixed Student-t 1/(1+d²) for better cluster separation
Configurable negative sampling neg_sample_rate parameter (default 5); formula fixed from n_pos × rate / kn_pos × rate
Verbose logging All training output gated behind verbose flag; improved structured messages with timings, edge counts, kernel params, stop reasons
ManifoldParams min_dist and spread now actively shape the embedding kernel (previously defined but unused)
New API Umap::new(config).fit(data) returns FittedUmap with .embedding(), .transform(), .into_embedding() — mirrors umap-rs
Sparse training O(n·k) per epoch with edge subsampling + configurable negative sampling (was O(n²))

Features

  • Dimensionality Reduction — projects high-dimensional data to 2-D or 3-D for visualization
  • Parametric model — learned neural network can project new, unseen data via transform()
  • GPU-accelerated kernels — custom CubeCL kernels for Euclidean pairwise distance and KNN, compiled for WGPU (Metal / Vulkan / DX12)
  • Automatic differentiation — full autograd through the custom kernels via burn's autodiff backend
  • Sparse training — edge subsampling + negative sampling keeps per-epoch cost constant regardless of dataset size
  • Flexible architecture — configurable hidden layers, output dims, distance metric, learning rate, early stopping, timeout
  • CPU fallback — all model code runs on NdArray (no GPU required for inference or tests)
  • 36 unit tests — covering normalization, tensor conversion, model shape, distance math
  • Hardware-tagged benchmarks — CPU and GPU timings saved as Markdown + SVG, including a CPU vs GPU comparison chart

Installation

cargo add fast-umap

Cargo.toml:

[dependencies]
fast-umap  = "1.2.2"
burn       = { version = "0.20.1", features = ["wgpu", "autodiff", "autotune"] }
cubecl     = { version = "0.9.0",  features = ["wgpu"] }

Quick Start

use cubecl::wgpu::WgpuRuntime;
use fast_umap::prelude::*;

type MyBackend = burn::backend::wgpu::CubeBackend<WgpuRuntime, f32, i32, u32>;
type MyAutodiffBackend = burn::backend::Autodiff<MyBackend>;

// 100 samples × 10 features
let data: Vec<Vec<f64>> = generate_test_data(100, 10)
    .chunks(10)
    .map(|c| c.iter().map(|&x: &f32| x as f64).collect())
    .collect();

// Fit UMAP (default: 2-D output, Euclidean distance)
let config = UmapConfig::default();
let umap = fast_umap::Umap::<MyAutodiffBackend>::new(config);
let fitted = umap.fit(data.clone(), None);

// Get embedding
let embedding = fitted.embedding();
println!("Shape: {} × {}", embedding.len(), embedding[0].len());

Backend Choice: CPU vs GPU with Feature-Based Compilation

fast-umap supports both CPU and GPU backends with feature-based compilation for optimal flexibility:

GPU Backend (WGPU) - Primary Recommended Backend

For GPU-accelerated execution (requires WGPU-compatible GPU):

use cubecl::wgpu::WgpuRuntime;
use fast_umap::prelude::*;

type MyBackend = burn::backend::wgpu::CubeBackend<WgpuRuntime, f32, i32, u32>;
type MyAutodiffBackend = burn::backend::Autodiff<MyBackend>;

let config = UmapConfig::default();
let umap = fast_umap::Umap::<MyAutodiffBackend>::new(config);
let fitted = umap.fit(data, None);

Features:

  • ✅ Full parametric UMAP with neural network training
  • ✅ GPU acceleration via WGPU (Metal/Vulkan/DX12)
  • ✅ Transform new data through trained model
  • ✅ Custom GPU kernels for efficient computation
  • ✅ Automatic differentiation for training

CPU Backend - Full Functionality with umap-rs Fallback

For CPU-only execution or when GPU is not available, fast-umap provides a complete CPU backend that uses classical UMAP computation:

#[cfg(feature = "cpu")]
{
    use fast_umap::cpu_backend::api as cpu_api;
    
    // Full CPU UMAP with umap-rs backend
    let config = UmapConfig::default();
    let fitted = cpu_api::fit_cpu(config, data, None);
    let embedding = fitted.embedding();
}

Features:

  • ✅ Complete UMAP functionality using classical UMAP algorithm
  • ✅ No GPU required
  • ✅ Same API as GPU backend for consistency
  • ✅ Full configuration support (n_components, n_neighbors, etc.)
  • ✅ Excellent for environments without GPU access
  • ❌ Cannot transform new data (classical UMAP limitation)

When to use CPU backend:

  • Development and testing environments without GPU
  • Cloud environments with CPU-only instances
  • Edge devices without GPU acceleration
  • Fallback when GPU drivers are unavailable

See cpu_training_demo example for complete CPU usage.

Feature-Based Compilation

fast-umap uses Cargo features to enable only the backends you need:

# GPU backend (default, includes gpu + verbose features)
cargo build --release

# CPU-only backend
cargo build --release --features cpu

# Minimal build (no backends, library mode only)
cargo build --release --no-default-features

# All features (development)
cargo build --release --features all

See feature_demo example for complete feature matrix.

Runtime Backend Selection

The backend_choice example demonstrates runtime backend selection:

# Run with GPU backend
cargo run --release --example backend_choice gpu

# Run with CPU backend  
cargo run --release --features cpu --example backend_choice cpu

CPU Testing Examples

For working CPU examples, see the test suite:

# Run CPU-based tests
cargo test --lib

The tests in tests/tests.rs demonstrate NdArray backend usage for CPU computation.


API Overview

The public API mirrors umap-rs:

Type Description
Umap<B> Main algorithm struct — Umap::new(config)
FittedUmap<B> Fitted model — .embedding(), .transform(), .into_embedding(), .config()
UmapConfig Configuration with nested GraphParams + OptimizationParams
ManifoldParams min_dist, spread — control cluster tightness and separation
GraphParams n_neighbors, metric, normalized, minkowski_p
OptimizationParams n_epochs, learning_rate, patience, timeout, verbose, neg_sample_rate, …
Metric Euclidean, EuclideanKNN, Manhattan, Cosine

Configuration

let config = UmapConfig {
    n_components: 2,
    hidden_sizes: vec![128],
    graph: GraphParams {
        n_neighbors: 15,
        metric: Metric::Euclidean,
        ..Default::default()
    },
    optimization: OptimizationParams {
        n_epochs: 200,
        learning_rate: 1e-3,
        patience: Some(50),
        verbose: true,
        ..Default::default()
    },
    ..Default::default()
};

Key parameters

Parameter Default Description
n_components 2 Output dimensionality (2-D or 3-D)
hidden_sizes [100, 100, 100] Neural network hidden layer sizes
min_dist 0.1 Min distance in embedding — smaller = tighter clusters
spread 1.0 Effective scale of embedded points
n_neighbors 15 KNN graph neighbours
n_epochs 200 Training epochs
learning_rate 1e-3 Adam step size
batch_size 1 000 Samples per batch
penalty 0.0 L2 weight decay
metric Euclidean Distance metric
repulsion_strength 1.0 Repulsion term weight
neg_sample_rate 5 Negative (repulsion) samples per positive edge per epoch
cooldown_ms 0 Sleep N ms between epochs to reduce GPU utilisation (0 = disabled)
patience None Early-stop epochs without improvement
min_desired_loss None Stop when loss ≤ threshold
timeout None Hard time limit (seconds)
verbose true Progress bar + loss plots

Transform new data

Because fast-umap is parametric (neural network), it can project new unseen data — something classical UMAP cannot do:

let fitted = umap.fit(training_data, None);

// Project new data through the trained model
let new_embedding = fitted.transform(new_data);

Examples

CPU Capabilities Demo

Demonstrates CPU backend capabilities for utility functions and tensor operations:

cargo run --release --example cpu_training_demo

Backend Choice — CPU vs GPU selection

Demonstrates how to choose between CPU (NdArray) and GPU (WGPU) backends at runtime:

# Run with GPU backend (default)
cargo run --release --example backend_choice gpu

# Run with CPU backend  
cargo run --release --example backend_choice cpu

Note about CPU backend: The CPU option shows backend selection structure. For utility functions, see cpu_training_demo. The GPU backend provides full parametric UMAP with neural network training and the ability to transform new data.

Simple — random data, 2-D embedding

cargo run --release --example simple
use cubecl::wgpu::WgpuRuntime;
use fast_umap::prelude::*;

fn main() {
    type MyBackend = burn::backend::wgpu::CubeBackend<WgpuRuntime, f32, i32, u32>;
    type MyAutodiffBackend = burn::backend::Autodiff<MyBackend>;

    let data: Vec<Vec<f64>> = (0..100 * 3)
        .map(|_| rand::rng().random::<f64>())
        .collect::<Vec<_>>()
        .chunks_exact(3)
        .map(|c| c.to_vec())
        .collect();

    let config = UmapConfig::default();
    let umap = fast_umap::Umap::<MyAutodiffBackend>::new(config);
    let fitted = umap.fit(data.clone(), None);

    let embedding = fitted.embedding();
    println!("Embedding shape: {} × {}", embedding.len(), embedding[0].len());

    // Transform new unseen data through the trained model
    let _new_embedding = fitted.transform(data);
}

Advanced — 1 000 samples, custom config

cargo run --release --example advanced
use cubecl::wgpu::WgpuRuntime;
use fast_umap::prelude::*;

fn main() {
    type MyBackend = burn::backend::wgpu::CubeBackend<WgpuRuntime, f32, i32, u32>;
    type MyAutodiffBackend = burn::backend::Autodiff<MyBackend>;

    let train_data: Vec<f32> = generate_test_data(1000, 100);
    let data: Vec<Vec<f32>> = train_data.chunks(100).map(|c| c.to_vec()).collect();

    let config = UmapConfig {
        n_components: 2,
        hidden_sizes: vec![100, 100, 100],
        graph: GraphParams {
            n_neighbors: 10,
            metric: Metric::Euclidean,
            ..Default::default()
        },
        optimization: OptimizationParams {
            n_epochs: 100,
            batch_size: 1000,
            learning_rate: 0.001,
            verbose: true,
            min_desired_loss: Some(0.001),
            timeout: Some(60),
            ..Default::default()
        },
        ..Default::default()
    };

    let umap = fast_umap::Umap::<MyAutodiffBackend>::new(config);
    let fitted = umap.fit(data, None);

    println!("Embedding: {} samples × {} dims",
        fitted.embedding().len(), fitted.embedding()[0].len());
}

MNIST — 10 000 hand-written digits projected to 2-D

cargo run --release --example mnist         # quick run (no figures)
./bench.sh --only mnist                     # generate figures

Downloads the MNIST dataset on first run (~12 MB), trains UMAP on 10K digits (784 features → 2-D).

2-D digit embedding (coloured by class) Loss curve
MNIST embedding MNIST loss
let config = UmapConfig {
    n_components: 2,
    hidden_sizes: vec![256],
    graph: GraphParams {
        n_neighbors: 15,
        metric: Metric::Euclidean,
        ..Default::default()
    },
    optimization: OptimizationParams {
        n_epochs: 1_000,
        learning_rate: 1e-3,
        patience: Some(100),
        repulsion_strength: 3.0,
        penalty: 1e-6,
        verbose: true,
        ..Default::default()
    },
    ..Default::default()
};

let umap = fast_umap::Umap::<MyAutodiffBackend>::new(config);
let fitted = umap.fit(data, Some(labels));

Generating Figures

All figures are generated by the benchmark suite, not by the examples. Examples are lightweight smoke tests that verify correctness without writing any files.

# Run all benchmarks and generate all figures
./bench.sh

# Run only specific benchmarks
./bench.sh --only comparison          # fast-umap vs umap-rs chart
./bench.sh --only mnist               # MNIST embedding + loss curve
./bench.sh --only hardware            # CPU + GPU micro-benchmarks
./bench.sh --only mnist comparison    # combine multiple

# Skip MNIST (saves ~70s)
./bench.sh --skip-mnist

# Include Criterion statistical suite
./bench.sh --criterion

What generates what

Benchmark Output
comparison figures/crate_comparison.{svg,json,md}
mnist figures/mnist.png, figures/losses_model.png
hardware benches/results/{cpu,gpu,comparison}_*.{svg,md}
criterion target/criterion/ (HTML reports)

Run examples (no figures)

./run_all.sh                  # all examples
./run_all.sh --skip-mnist     # skip MNIST download

CPU Capabilities Demo

Demonstrates CPU backend capabilities:

# Run CPU demo (requires cpu feature)
cargo run --release --features cpu --example cpu_training_demo

Developer Experience Improvements

The feature-based compilation system provides:

✅ Faster Compilation

  • Compile only what you need: cargo build --release --features cpu
  • Minimal builds: cargo build --release --no-default-features
  • Feature-specific examples and documentation

✅ Smaller Binaries

  • GPU-only: exclude CPU dependencies (save ~5MB)
  • CPU-only: exclude GPU dependencies (save ~10MB)
  • Library mode: minimal footprint for integration

✅ Clear Feature Matrix

Feature Description When to Use
gpu GPU backend (WGPU) Production, large datasets
cpu CPU backend (umap-rs) CPU-only environments
verbose Progress output Development, debugging
plotters Visualization Exploration, analysis
all Everything Development, testing

✅ Better Error Messages

  • Clear feature requirements at compile time
  • Graceful runtime fallback handling
  • Comprehensive validation

See feature_demo example for complete feature guide.


Architecture

fast-umap uses a parametric approach — a small feed-forward neural network is trained with the UMAP cross-entropy loss:

attraction  =  mean_{k-NN edges}      [ −log q_ij ]
repulsion   =  mean_{negative samples} [ −log (1 − q_ij) ]
loss        =  attraction  +  repulsion_strength × repulsion

where q_ij = 1 / (1 + a · d_ij^(2b)) is the UMAP kernel applied to embedding distances (a and b are fitted from min_dist / spread).

Training pipeline

Input data [n, features]
    │
    ▼
GPU pairwise distance → KNN graph (one-time O(n²) cost)
    │
    ▼
┌─── Per epoch (cost: O(min(n·k, 50K))) ───┐
│  Forward pass: data → neural net → [n, 2] │
│  Edge subsampling from KNN graph           │
│  Negative sampling (random non-neighbors)  │
│  UMAP cross-entropy loss                   │
│  Backward pass + Adam optimizer step       │
└────────────────────────────────────────────┘
    │
    ▼
FittedUmap with .embedding() and .transform()

Modules

Module Description
[model] UMAPModel neural network and config builder
[train] Training loop, UmapConfig, sparse training, loss computation
[chart] 2-D scatter plots and loss curves (plotters)
[utils] Data generation, tensor conversion, normalisation
[kernels] Custom CubeCL GPU kernels (Euclidean distance, k-NN)
[backend] Backend trait extension for custom kernel dispatch
[distances] CPU-side distance functions (Euclidean, cosine, Minkowski…)
[prelude] Re-exports of the most commonly used items

Legacy API

The original UMAP struct and train() function are still available for backward compatibility:

use fast_umap::prelude::*;

// Legacy one-liner
let umap: fast_umap::UMAP<MyAutodiffBackend> = umap(data.clone());
let embedding = umap.transform(data);

// Legacy manual training
let (model, losses, best_loss) = fast_umap::train::train(
    "my_run", model, num_samples, num_features,
    train_data, &config, device, exit_rx, None,
);

Note: The legacy API uses the dense O(n²) training path. Use the new Umap::new(config).fit(data) API for the optimized sparse training path.


Testing

All tests run on CPU (burn::backend::NdArray) — no GPU required.

cargo test
Category What is covered
normalize_data correctness, zero-mean/unit-std, constant columns
format_duration zero, seconds, minutes, hours
generate_test_data shape, bounds [0, 1)
tensor_convert round-trip Vec → Tensor → Vec, NaN → 0
normalize_tensor output in [0, 1], constant-input safety
layer_normalize no NaN, shape preserved
UMAPModelConfigBuilder defaults and custom values
TrainingConfig builder, Metric::from(&str), invalid-metric panic
UMAPModel 2-D / 3-D output, deep network, determinism, no NaN
Distance math Euclidean self=0, 3-4-5 triangle, symmetry; Manhattan

Micro-benchmarks

Reproduce: ./bench.sh --only hardware

Full detail files: cpu_apple_m3_max.md · gpu_apple_m3_max.md · comparison_apple_m3_max.md · cpu_apple_silicon_aarch64.md

CPU — Apple M3 Max (NdArray backend)

CPU benchmark chart

Benchmark Input Min Mean Max
normalize_data 100×10 345 µs 517 µs 986 µs
normalize_data 500×30 1.92 ms 2.31 ms 2.77 ms
normalize_data 1 000×50 4.35 ms 4.78 ms 5.79 ms
normalize_data 5 000×100 16.4 ms 18.2 ms 19.4 ms
generate_test_data 100×10 3.62 µs 4.03 µs 5.92 µs
generate_test_data 500×30 56.3 µs 57.8 µs 61.9 µs
generate_test_data 1 000×50 246 µs 258 µs 288 µs
generate_test_data 5 000×100 2.46 ms 2.48 ms 2.54 ms
tensor_convert 100×10 5.08 µs 5.16 µs 5.33 µs
tensor_convert 500×30 32.4 µs 33.0 µs 35.8 µs
tensor_convert 1 000×50 74.6 µs 78.1 µs 89.5 µs
model_forward 16s×10f [32]→2 21.4 µs 34.5 µs 55.8 µs
model_forward 64s×50f [64]→2 26.7 µs 34.2 µs 51.2 µs
model_forward 128s×50f [128]→2 55.9 µs 58.6 µs 64.2 µs
model_forward 64s×100f [128,64]→3 70.0 µs 80.3 µs 106 µs
model_forward 256s×100f [256,128]→2 279 µs 293 µs 310 µs
normalize_tensor n=64 1.88 µs 1.95 µs 2.29 µs
normalize_tensor n=512 2.79 µs 2.86 µs 3.00 µs
normalize_tensor n=4 096 9.50 µs 9.59 µs 10.3 µs
normalize_tensor n=32 768 70.0 µs 70.3 µs 70.7 µs
layer_normalize 32×16 3.50 µs 3.62 µs 3.96 µs
layer_normalize 128×64 19.7 µs 20.0 µs 20.9 µs
layer_normalize 512×128 115 µs 117 µs 130 µs
layer_normalize 1 000×256 412 µs 420 µs 454 µs

GPU — Apple M3 Max (WGPU / Metal)

GPU benchmark chart

Benchmark Input Min Mean Max
model_forward 16s×10f [32]→2 408 µs 617 µs 894 µs
model_forward 64s×50f [64]→2 430 µs 481 µs 776 µs
model_forward 128s×50f [128]→2 432 µs 475 µs 576 µs
model_forward 64s×100f [128,64]→3 549 µs 688 µs 1.82 ms
model_forward 256s×100f [256,128]→2 631 µs 691 µs 828 µs
model_forward 512s×100f [256,128]→2 926 µs 1.08 ms 1.42 ms
normalize_tensor n=512 572 µs 695 µs 1.28 ms
normalize_tensor n=4 096 590 µs 662 µs 821 µs
normalize_tensor n=32 768 629 µs 712 µs 883 µs
normalize_tensor n=262 144 1.08 ms 1.12 ms 1.22 ms
layer_normalize 128×64 437 µs 471 µs 609 µs
layer_normalize 512×128 467 µs 500 µs 648 µs
layer_normalize 1 000×256 617 µs 662 µs 777 µs
layer_normalize 4 000×512 1.81 ms 1.93 ms 2.15 ms

CPU vs GPU — Apple M3 Max

CPU vs GPU comparison

Benchmark Input CPU GPU Speedup
model_forward 16s×10f [32]→2 34.5 µs 617 µs 0.06× (CPU faster)
model_forward 64s×50f [64]→2 34.2 µs 481 µs 0.07× (CPU faster)
model_forward 128s×50f [128]→2 58.6 µs 475 µs 0.12× (CPU faster)
model_forward 64s×100f [128,64]→3 80.3 µs 688 µs 0.12× (CPU faster)
model_forward 256s×100f [256,128]→2 293 µs 691 µs 0.42× (CPU faster)
normalize_tensor n=512 2.86 µs 695 µs 0.00× (CPU faster)
normalize_tensor n=4 096 9.59 µs 662 µs 0.01× (CPU faster)
normalize_tensor n=32 768 70.3 µs 712 µs 0.10× (CPU faster)
layer_normalize 128×64 20.0 µs 471 µs 0.04× (CPU faster)
layer_normalize 512×128 117 µs 500 µs 0.23× (CPU faster)
layer_normalize 1 000×256 420 µs 662 µs 0.63× (CPU faster)

Note: WGPU/Metal has a fixed dispatch overhead of ~400–700 µs per kernel call. For the small model sizes above, that overhead dominates. The GPU wins in full UMAP training loops over thousands of samples where operations are chained without intermediate CPU readbacks.

Running benchmarks

# All benchmarks (hardware + comparison + MNIST)
./bench.sh

# + Criterion statistics (~5 min)
./bench.sh --criterion

# Just hardware micro-benchmarks
./bench.sh --only hardware

Roadmap

  • MNIST dataset example with intermediate plots
  • Charting behind a feature flag
  • Labels in plots
  • Batching + accumulated gradient
  • CubeCL kernels for distance computation
  • Hyperparameter testbench (patience vs n_features vs epochs …)
  • Unit tests (36) and hardware benchmarks
  • New API mirroring umap-rs (Umap, FittedUmap, UmapConfig)
  • Sparse training with edge subsampling + negative sampling
  • Crate comparison benchmark (fast-umap vs umap-rs)
  • PCA warm-start for initial embedding
  • Approximate KNN (NN-descent) for datasets > 50K

License

MIT — see LICENSE.

Copyright

2024-2026, Eugene Hauptmann

Citations

If you use fast-umap in research or a project, please cite the original UMAP paper, this repository, and acknowledge the Burn and CubeCL frameworks:

fast-umap

@software{hauptmann2024fastumap,
  title   = {fast-umap: GPU-Accelerated UMAP in Rust},
  author  = {Hauptmann, Eugene},
  year    = {2024},
  url     = {https://github.com/eugenehp/fast-umap},
  version = {1.2.2}
}

Hauptmann, E. (2024). fast-umap: GPU-Accelerated UMAP in Rust (v1.2.2). https://github.com/eugenehp/fast-umap

UMAP algorithm

@article{mcinnes2018umap,
  title   = {{UMAP}: Uniform Manifold Approximation and Projection for Dimension Reduction},
  author  = {McInnes, Leland and Healy, John and Melville, James},
  journal = {arXiv preprint arXiv:1802.03426},
  year    = {2018},
  url     = {https://arxiv.org/abs/1802.03426}
}

McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv:1802.03426. https://arxiv.org/abs/1802.03426

Burn deep-learning framework

@software{burn2024,
  title   = {Burn: A Flexible and Comprehensive Deep Learning Framework in Rust},
  author  = {{Tracel AI}},
  year    = {2024},
  url     = {https://github.com/tracel-ai/burn}
}

CubeCL GPU compute framework

@software{cubecl2024,
  title  = {CubeCL: Multi-platform GPU Compute Language for Rust},
  author = {{Tracel AI}},
  year   = {2024},
  url    = {https://github.com/tracel-ai/cubecl}
}

Thank you

Inspired by the original UMAP paper.