RAG-UMAP: A Rust Implementation of UMAP
A pure Rust implementation of UMAP (Uniform Manifold Approximation and Projection) for dimension reduction, based on the paper by Leland McInnes, John Healy, and James Melville.
Features
- Complete UMAP Implementation: All core algorithms from the original paper
- Multiple Distance Metrics: Euclidean, Manhattan, Cosine, Hamming, Chebyshev, Minkowski, and Jaccard
- Approximate Nearest Neighbors: NN-Descent algorithm for efficient k-NN search
- Spectral Initialization: Power iteration method for robust spectral embedding
- Stochastic Gradient Descent: Optimized SGD with negative sampling
- Configurable Parameters: Full control over all UMAP hyperparameters
Algorithms Implemented
- Algorithm 1: Main UMAP algorithm
- Algorithm 2: Local fuzzy simplicial set construction
- Algorithm 3: Smooth k-NN distance computation
- Algorithm 4: Spectral embedding initialization
- Algorithm 5: SGD optimization with negative sampling
Quick Start
Add this to your Cargo.toml:
[]
= "0.1.0"
Basic Usage
use ;
// Your high-dimensional embeddings data
let embeddings = vec!;
// Convert to 2D using UMAP
let embedding_2d = convert_to_2d?;
println!;
// Convert to 3D using UMAP
let embedding_3d = convert_to_3d?;
println!;
Supported Input Types
The conversion functions accept any type that implements Into<f64> + Copy:
// Works with integers
let int_embeddings = vec!;
let result = convert_to_2d?;
// Works with floats
let float_embeddings = vec!;
let result = convert_to_2d?;
Algorithm Parameters
The library uses optimized default parameters internally:
| Parameter | Description | Default |
|---|---|---|
n_neighbors |
Number of nearest neighbors | 15 (or data size - 1 if smaller) |
n_components |
Target embedding dimension | 2 for convert_to_2d, 3 for convert_to_3d |
min_dist |
Minimum distance between points in embedding | 0.1 |
n_epochs |
Number of optimization epochs | 200 |
negative_sample_rate |
Negative samples per positive sample | 5 |
spread |
Spread parameter for low-dimensional representation | 1.0 |
local_connectivity |
Local connectivity parameter | 1.0 |
repulsion_strength |
Repulsion strength parameter | 1.0 |
How UMAP Works
- Local Structure: Preserves fine-grained neighborhood relationships
- Global Structure: Maintains overall data topology
- Automatic Scaling: Adapts neighbor count to dataset size
- Euclidean Distance: Uses L2 distance metric internally
Examples
See the examples/ directory for complete examples:
Performance
This implementation includes several optimizations:
- NN-Descent: O(N^1.14) approximate nearest neighbor search
- Power Iteration: Robust spectral embedding for large graphs
- Negative Sampling: Efficient repulsive force computation
- Sparse Representations: Memory-efficient fuzzy simplicial sets
Theoretical Background
UMAP is based on:
- Riemannian Geometry: Local manifold approximation
- Algebraic Topology: Fuzzy simplicial set representations
- Category Theory: Functors between metric spaces and topological structures
The algorithm constructs a high-dimensional fuzzy topological representation of the data, then optimizes a low-dimensional representation to match this structure using cross-entropy minimization.
Comparison with Other Methods
| Method | Local Structure | Global Structure | Scalability | Embedding Dimension |
|---|---|---|---|---|
| UMAP | ✓ | ✓ | High | Any |
| t-SNE | ✓ | Limited | Medium | Typically 2-3 |
| PCA | Limited | ✓ | High | Any |
| Isomap | ✓ | ✓ | Low | Any |
Limitations
- Assumes data lies on a manifold
- May find structure in noise (constellation effect)
- Local distance prioritized over global distance
- Requires parameter tuning for optimal results
License
This implementation is based on the UMAP paper:
McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv preprint arXiv:1802.03426.
Contributing
Contributions are welcome!