Crate pacmap

Source
Expand description

§PaCMAP: Pairwise Controlled Manifold Approximation

This crate provides a Rust implementation of PaCMAP (Pairwise Controlled Manifold Approximation), a dimensionality reduction technique that preserves both local and global structure of high-dimensional data.

PaCMAP transforms high-dimensional data into a lower-dimensional representation while preserving important relationships between points. This is useful for visualization, analysis, and as preprocessing for other algorithms.

§Key Features

PaCMAP preserves both local and global structure through three types of point relationships:

  • Nearest neighbor pairs preserve local structure
  • Mid-near pairs preserve intermediate structure
  • Far pairs prevent collapse and maintain separation

The implementation provides:

  • Configurable optimization with adaptive learning rates via Adam optimization
  • Phase-based weight schedules to balance local and global preservation
  • Multiple initialization options including PCA and random seeding
  • Optional snapshot capture of intermediate states

§Examples

Basic usage with default parameters:

use ndarray::Array2;
use pacmap::{Configuration, fit_transform};

let data: Array2<f32> = // ... load your high-dimensional data
let config = Configuration::default();
let (embedding, _) = fit_transform(data.view(), config).unwrap();

Customized embedding:

use pacmap::{Configuration, Initialization};

let config = Configuration::builder()
    .embedding_dimensions(3)
    .initialization(Initialization::Random(Some(42)))
    .learning_rate(0.8)
    .num_iters((50, 50, 100))
    .mid_near_ratio(0.3)
    .far_pair_ratio(2.0)
    .build();

Capturing intermediate states:

use pacmap::Configuration;

let config = Configuration::builder()
    .snapshots(vec![100, 200, 300])
    .build();

§Configuration

Core parameters:

  • embedding_dimensions: Output dimensionality (default: 2)
  • initialization: How to initialize coordinates:
    • Pca - Project data using PCA (default)
    • Value(array) - Use provided coordinates
    • Random(seed) - Random initialization with optional seed
  • learning_rate: Learning rate for Adam optimizer (default: 1.0)
  • num_iters: Iteration counts for three optimization phases (default: (100, 100, 250))
  • snapshots: Optional vector of iterations at which to save embedding states
  • approx_threshold: Number of points above which approximate neighbor search is used

Pair sampling parameters:

  • mid_near_ratio: Ratio of mid-near to nearest neighbor pairs (default: 0.5)
  • far_pair_ratio: Ratio of far to nearest neighbor pairs (default: 2.0)
  • override_neighbors: Optional fixed neighbor count override
  • seed: Optional random seed for reproducible sampling

§Feature Flags

§BLAS/LAPACK Backends

Only one BLAS/LAPACK backend feature should be enabled at a time. These are required for PCA operations except on macOS which uses Accelerate by default.

  • intel-mkl-static - Static linking with Intel MKL
  • intel-mkl-system - Dynamic linking with system Intel MKL
  • openblas-static - Static linking with OpenBLAS
  • openblas-system - Dynamic linking with system OpenBLAS
  • netlib-static - Static linking with Netlib
  • netlib-system - Dynamic linking with system Netlib

For more details on BLAS/LAPACK configuration, see the ndarray-linalg documentation.

§Performance Features

  • simsimd - Enable SIMD optimizations in USearch for faster approximate nearest neighbor search. Requires GCC 13+ for compilation and a recent glibc at runtime.

§Implementation Notes

  • Supports both exact and approximate nearest neighbor search
  • Uses Euclidean distances for pair relationships
  • Leverages ndarray for efficient matrix operations
  • Employs parallel iterators via rayon for performance
  • Provides detailed error handling with custom error types

§References

Understanding How Dimension Reduction Tools Work: An Empirical Approach to Deciphering t-SNE, UMAP, TriMap, and PaCMAP for Data Visualization. Wang, Y., Huang, H., Rudin, C., & Shaposhnik, Y. (2021). Journal of Machine Learning Research, 22(201), 1-73.

Original Python implementation: https://github.com/YingfanWang/PaCMAP

Modules§

knn
K-nearest neighbor computation for PaCMAP dimensionality reduction.

Structs§

Configuration
Configuration options for the PaCMAP embedding process.
ConfigurationBuilder
Use builder syntax to set the inputs and finish with build().

Enums§

Initialization
Methods for initializing the embedding coordinates.
PaCMapError
Errors that can occur during PaCMAP embedding.
PairConfiguration
Strategy for sampling pairs during optimization.

Functions§

fit_transform
Reduces dimensionality of input data using PaCMAP.