pacmap
A Rust implementation of PaCMAP (Pairwise Controlled Manifold Approximation) for dimensionality reduction based on the original Python implementation.
Overview
Dimensionality reduction transforms high-dimensional data into a lower-dimensional representation while preserving important relationships between points. This is useful for visualization, analysis, and as preprocessing for other algorithms.
PaCMAP is a relatively recent dimensionality reduction technique that preserves both local and global structure through three types of point relationships:
- Nearest neighbor pairs preserve local structure
- Mid-near pairs preserve intermediate structure
- Far pairs prevent collapse and maintain separation
For details on the algorithm, see the original paper.
Features
- Fast approximate nearest neighbors for large datasets using USearch
- SIMD-optimized distance calculations
- Parallel processing with Rayon
- Optional PCA initialization using various BLAS backends
- Reproducible results with optional seeding
Usage
Basic usage with default parameters:
use Result;
use Array2;
use RandomExt;
use Uniform;
use ;
Customized embedding:
use Result;
use ;
Capturing intermediate states:
use Result;
use Configuration;
For a standalone example, see the pacmap-rs-example repository.
Configuration
Core Parameters
embedding_dimensions: Output dimensionality (default: 2)initialization: How to initialize coordinates:Pca- Project data using PCA (default)Value(array)- Use provided coordinatesRandom(seed)- Random initialization with optional seed
learning_rate: Learning rate for Adam optimizer (default: 1.0)num_iters: Iteration counts for three optimization phases (default: (100, 100, 250)):- Mid-near weight reduction phase
- Balanced weight phase
- Local structure focus phase
snapshots: Optional vector of iterations at which to save embedding statesapprox_threshold: Number of samples above which to use approximate nearest neighbors (default: 8,000)
Pair Sampling Parameters
mid_near_ratio: Ratio of mid-near to nearest neighbor pairs (default: 0.5)far_pair_ratio: Ratio of far to nearest neighbor pairs (default: 2.0)override_neighbors: Optional fixed neighbor count override (default: None, auto-scaled with dataset size)seed: Optional random seed for reproducible sampling and initialization
Pair Configuration
PairConfiguration::Generate- Generate all pairs from scratch (default)PairConfiguration::NeighborsProvided { pair_neighbors }- Use provided nearest neighbors, generate remaining pairsPairConfiguration::AllProvided { pair_neighbors, pair_mn, pair_fp }- Use all provided pairs
Feature Flags
BLAS/LAPACK Backends
Only one BLAS/LAPACK backend feature should be enabled at a time. These are required for PCA operations except on macOS which uses Accelerate by default.
intel-mkl-static- Static linking with Intel MKLintel-mkl-system- Dynamic linking with system Intel MKLopenblas-static- Static linking with OpenBLASopenblas-system- Dynamic linking with system OpenBLASnetlib-static- Static linking with Netlibnetlib-system- Dynamic linking with system Netlib
For example:
[]
= { = "0.2", = ["openblas-static"] }
See ndarray-linalg's documentation for detailed information about BLAS/LAPACK configuration and performance considerations.
Performance Features
simsimd- Enable SIMD optimizations in USearch for faster approximate nearest neighbor search. Requires GCC 13+ for compilation and a recent glibc at runtime.
Limitations
This implementation currently:
- Only supports Euclidean distances
- Does not support incremental transform
References
Understanding How Dimension Reduction Tools Work: An Empirical Approach to Deciphering t-SNE, UMAP, TriMap, and PaCMAP for Data Visualization. Wang, Y., Huang, H., Rudin, C., & Shaposhnik, Y. (2021). Journal of Machine Learning Research, 22(201), 1-73.
License
Apache License, Version 2.0