scirs2-cluster 0.3.1

Clustering algorithms module for SciRS2 (scirs2-cluster)
Documentation

scirs2-cluster

crates.io License Documentation

Comprehensive clustering algorithms for unsupervised learning in Rust, part of the SciRS2 scientific computing ecosystem.

Overview

scirs2-cluster provides production-ready implementations of classical and modern clustering algorithms with SciPy/scikit-learn compatible APIs. v0.3.1 significantly expands beyond the core algorithms with Gaussian Mixture Models, Self-Organizing Maps, topological clustering, streaming/online methods, fuzzy clustering, deep clustering, Bayesian nonparametric methods, and advanced validation tools.

Features

Partitional Clustering (Vector Quantization)

  • K-means with multiple initialization strategies
  • K-means++ smart initialization (faster convergence)
  • Mini-batch K-means for large-scale datasets
  • Parallel K-means using Rayon
  • kmeans2 with SciPy-compatible interface
  • Data whitening / normalization utilities

Hierarchical Clustering

  • Agglomerative clustering with full linkage method suite: single, complete, average, Ward, centroid, median, weighted
  • Optimized Ward's method: O(n2 log n) vs naive O(n3)
  • Dendrogram utilities and flat cluster extraction (fcluster)
  • Dendrogram export (Newick, JSON)

Density-Based Clustering

  • DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
  • OPTICS (Ordering Points To Identify the Clustering Structure)
  • HDBSCAN (Hierarchical DBSCAN)
  • Density peaks algorithm
  • Density ratio estimation clustering

Probabilistic and Mixture Models

  • Gaussian Mixture Models (GMM) with full EM algorithm
  • Bayesian GMM with variational inference
  • Dirichlet Process mixture models (nonparametric Bayesian)
  • Probabilistic soft assignments

Prototype-Based and Competitive Learning

  • Self-Organizing Maps (SOM) with hexagonal and rectangular topologies
  • Competitive learning networks
  • Prototype-enhanced clustering with medoid refinement
  • Leader algorithm (single-pass with hierarchical tree)

Spectral and Graph-Based

  • Spectral clustering with multiple Laplacian variants
  • Affinity propagation (exemplar-based)
  • BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies)
  • Mean-shift clustering

Subspace Clustering

  • Subspace clustering for high-dimensional data
  • Projected clustering and axis-aligned subspace search
  • Advanced subspace methods (subspace_advanced/)

Fuzzy and Soft Clustering

  • Fuzzy c-means (FCM) with membership degree outputs
  • Soft clustering with probabilistic assignments
  • Possibilistic c-means

Topological Clustering

  • Topological data analysis applied to clustering
  • Persistent homology-based cluster boundary detection
  • Mapper algorithm integration

Streaming and Online Clustering

  • Online k-means (incremental updates)
  • ADWIN-based streaming cluster detection
  • CluStream and DenStream for data streams
  • Reservoir sampling for large data streams

Time Series Clustering

  • DTW-based distance for time series k-means
  • Temporal pattern clustering
  • Phase-space clustering

Ensemble and Consensus

  • Consensus clustering via co-association matrices
  • Evidence Accumulation Clustering (EAC)
  • Bagging-based and weighted voting ensembles
  • Stability-based cluster selection

Deep Clustering

  • Deep embedding via autoencoder
  • DEC (Deep Embedded Clustering)
  • Deep adversarial clustering
  • Transformer-based cluster embeddings

Biclustering and Co-clustering

  • Biclustering for simultaneous row/column clustering
  • Co-clustering (information-theoretic)

Evaluation Metrics

  • Silhouette coefficient (individual and average)
  • Davies-Bouldin index
  • Calinski-Harabasz index
  • Gap statistic for optimal k selection
  • Adjusted Rand Index (ARI)
  • Normalized Mutual Information (NMI)
  • Homogeneity, Completeness, V-measure
  • Stability analysis across bootstrap samples

Quick Start

Add to your Cargo.toml:

[dependencies]
scirs2-cluster = "0.3.1"

With parallel processing:

[dependencies]
scirs2-cluster = { version = "0.3.1", features = ["parallel"] }

K-means Clustering

use scirs2_cluster::vq::kmeans;
use scirs2_core::ndarray::Array2;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let data = Array2::from_shape_vec((6, 2), vec![
        1.0, 2.0,  1.2, 1.8,  0.8, 1.9,
        3.7, 4.2,  3.9, 3.9,  4.2, 4.1,
    ])?;

    let (centroids, labels) = kmeans(data.view(), 2, None, None, None, None)?;

    println!("Centroids: {:?}", centroids);
    println!("Labels: {:?}", labels);
    Ok(())
}

Hierarchical Clustering

use scirs2_cluster::hierarchy::{linkage, fcluster, LinkageMethod};
use scirs2_core::ndarray::Array2;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let data = Array2::from_shape_vec((6, 2), vec![
        1.0, 2.0,  1.2, 1.8,  0.8, 1.9,
        3.7, 4.2,  3.9, 3.9,  4.2, 4.1,
    ])?;

    let z = linkage(data.view(), LinkageMethod::Ward, None)?;
    let labels = fcluster(&z, 2, None)?;

    println!("Cluster assignments: {:?}", labels);
    Ok(())
}

DBSCAN

use scirs2_cluster::density::dbscan;
use scirs2_core::ndarray::Array2;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let data = Array2::from_shape_vec((8, 2), vec![
        1.0, 2.0,  1.5, 1.8,  1.3, 1.9,
        5.0, 7.0,  5.1, 6.8,  5.2, 7.1,
        0.0, 10.0, 10.0, 0.0,
    ])?;

    // eps=0.8, min_samples=2
    let labels = dbscan(data.view(), 0.8, 2, None)?;
    println!("Labels (-1 = noise): {:?}", labels);
    Ok(())
}

Gaussian Mixture Model

use scirs2_cluster::probabilistic::GaussianMixtureModel;
use scirs2_core::ndarray::Array2;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let data = Array2::<f64>::zeros((100, 2)); // replace with real data

    let mut gmm = GaussianMixtureModel::new(3, 100, 1e-6, 42)?;
    gmm.fit(data.view())?;

    let labels = gmm.predict(data.view())?;
    let responsibilities = gmm.predict_proba(data.view())?;
    println!("Soft assignments shape: {:?}", responsibilities.shape());
    Ok(())
}

Cluster Validation

use scirs2_cluster::metrics::{
    silhouette_score, davies_bouldin_score, calinski_harabasz_score,
};
use scirs2_core::ndarray::{Array2, Array1};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let data = Array2::<f64>::zeros((100, 5));
    let labels = Array1::<usize>::zeros(100);

    let sil = silhouette_score(data.view(), labels.view())?;
    let db  = davies_bouldin_score(data.view(), labels.view())?;
    let ch  = calinski_harabasz_score(data.view(), labels.view())?;

    println!("Silhouette: {:.4}", sil);
    println!("Davies-Bouldin: {:.4}", db);
    println!("Calinski-Harabasz: {:.4}", ch);
    Ok(())
}

Feature Flags

Flag Description
parallel Enable Rayon-based multi-threaded distance computation and fitting
simd SIMD-accelerated distance computations

Related Crates

License

Licensed under the Apache License, Version 2.0. See LICENSE for details.