scirs2-cluster 0.1.4

Clustering algorithms module for SciRS2 (scirs2-cluster)
Documentation

SciRS2 Clustering Module

crates.io License Documentation

A comprehensive clustering module for the SciRS2 scientific computing library in Rust (v0.1.4). Following the SciRS2 POLICY, this crate provides production-ready implementations of various clustering algorithms with a focus on performance, SciPy compatibility, ecosystem consistency, and idiomatic Rust code.

Production Readiness - stable Release

🎯 Version 0.1.0 (SciRS2 POLICY & Enhanced Performance) is ready for production use with:

  • 189+ comprehensive tests covering all algorithms and edge cases
  • Zero warnings policy enforced across all code and examples
  • Full SciPy API compatibility maintained for seamless migration
  • Extensive documentation with working examples for all features
  • Performance optimizations including SIMD and parallel processing

Stability & Performance

Algorithm Maturity

  • Core algorithms (K-means, Hierarchical, DBSCAN) are thoroughly tested and production-ready
  • Advanced algorithms (Spectral, BIRCH, GMM, HDBSCAN) are fully implemented with comprehensive test coverage
  • All APIs are stable and maintain backward compatibility with SciPy interfaces

Performance Characteristics

  • Optimized Ward's method: O(n² log n) complexity vs standard O(n³)
  • SIMD acceleration: Up to 4x faster distance computations on supported hardware
  • Parallel processing: Multi-core implementations for K-means and hierarchical clustering
  • Memory efficiency: Streaming and chunked processing for large datasets (>10M points)

Features

  • Vector Quantization

    • K-means clustering with multiple initialization methods
    • K-means++ smart initialization
    • kmeans2 with SciPy-compatible interface
    • Mini-batch K-means for large datasets
    • Parallel K-means for multi-core systems
    • Data whitening/normalization utilities
  • Hierarchical Clustering

    • Agglomerative clustering with multiple linkage methods:
      • Single linkage (minimum distance)
      • Complete linkage (maximum distance)
      • Average linkage
      • Ward's method (minimizes variance)
      • Centroid method (distance between centroids)
      • Median method
      • Weighted average
    • Dendrogram utilities and flat cluster extraction
    • Cluster distance metrics (Euclidean, Manhattan, Chebyshev, Correlation)
  • Density-Based Clustering

    • DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
    • OPTICS (Ordering Points To Identify the Clustering Structure)
    • HDBSCAN (Hierarchical DBSCAN)
    • Support for custom distance metrics
  • Other Algorithms

    • Mean-shift clustering
    • Spectral clustering
    • Affinity propagation
    • BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies)
    • Gaussian Mixture Models (GMM)
    • Leader algorithm (single-pass clustering with hierarchical tree support)
  • Evaluation Metrics

    • Silhouette coefficient
    • Davies-Bouldin index
    • Calinski-Harabasz index
    • Adjusted Rand Index
    • Normalized Mutual Information
    • Homogeneity, Completeness, and V-measure

Installation

Add this to your Cargo.toml:

[dependencies]
scirs2-cluster = "0.1.4"
ndarray = "0.15"

To enable optimizations through the core module, add feature flags:

[dependencies]
scirs2-cluster = { version = "0.1.4", features = ["parallel", "simd"] }

Usage

K-means Example

use ndarray::Array2;
use scirs2_cluster::vq::{kmeans, KMeansOptions, KMeansInit};

// Create a dataset
let data = Array2::from_shape_vec((6, 2), vec![
    1.0, 2.0,
    1.2, 1.8,
    0.8, 1.9,
    3.7, 4.2,
    3.9, 3.9,
    4.2, 4.1,
]).unwrap();

// Configure K-means
let options = KMeansOptions {
    init_method: KMeansInit::KMeansPlusPlus,
    max_iter: 300,
    ..Default::default()
};

// Run k-means with k=2
let (centroids, labels) = kmeans(data.view(), 2, Some(options)).unwrap();

println!("Centroids: {:?}", centroids);
println!("Cluster assignments: {:?}", labels);

kmeans2 (SciPy-compatible)

use scirs2_cluster::vq::{kmeans2, MinitMethod, MissingMethod, whiten};

// Whiten the data for better clustering
let whitened_data = whiten(&data).unwrap();

// Run kmeans2 with different initialization methods
let (centroids, labels) = kmeans2(
    whitened_data.view(),
    3,                             // k clusters
    Some(10),                      // iterations
    Some(1e-4),                    // threshold
    Some(MinitMethod::PlusPlus),   // K-means++ initialization
    Some(MissingMethod::Warn),     // warn on empty clusters
    Some(true),                    // check finite values
    Some(42),                      // random seed
).unwrap();

Mini-batch K-means

use scirs2_cluster::vq::{minibatch_kmeans, MiniBatchKMeansOptions};

// Configure mini-batch K-means
let options = MiniBatchKMeansOptions {
    batch_size: 1024,
    max_iter: 100,
    ..Default::default()
};

// Run clustering on large dataset
let (centroids, labels) = minibatch_kmeans(large_data.view(), 5, Some(options)).unwrap();

Hierarchical Clustering Example

use ndarray::Array2;
use scirs2_cluster::hierarchy::{linkage, fcluster, LinkageMethod};

// Create a dataset
let data = Array2::from_shape_vec((6, 2), vec![
    1.0, 2.0,
    1.2, 1.8,
    0.8, 1.9,
    3.7, 4.2,
    3.9, 3.9,
    4.2, 4.1,
]).unwrap();

// Calculate linkage matrix using Ward's method
let linkage_matrix = linkage(data.view(), LinkageMethod::Ward, None).unwrap();

// Form flat clusters by cutting the dendrogram
let num_clusters = 2;
let labels = fcluster(&linkage_matrix, num_clusters, None).unwrap();

println!("Cluster assignments: {:?}", labels);

Evaluation Metrics

use scirs2_cluster::metrics::{silhouette_score, davies_bouldin_score, calinski_harabasz_score};

// Evaluate clustering quality
let silhouette = silhouette_score(data.view(), labels.view()).unwrap();
let db_score = davies_bouldin_score(data.view(), labels.view()).unwrap();
let ch_score = calinski_harabasz_score(data.view(), labels.view()).unwrap();

println!("Silhouette score: {}", silhouette);
println!("Davies-Bouldin score: {}", db_score);
println!("Calinski-Harabasz score: {}", ch_score);

DBSCAN Example

use ndarray::Array2;
use scirs2_cluster::density::{dbscan, labels};

// Create a dataset with clusters and noise
let data = Array2::from_shape_vec((8, 2), vec![
    1.0, 2.0,   // Cluster 1
    1.5, 1.8,   // Cluster 1
    1.3, 1.9,   // Cluster 1
    5.0, 7.0,   // Cluster 2
    5.1, 6.8,   // Cluster 2
    5.2, 7.1,   // Cluster 2
    0.0, 10.0,  // Noise
    10.0, 0.0,  // Noise
]).unwrap();

// Run DBSCAN with eps=0.8 and min_samples=2
let cluster_labels = dbscan(data.view(), 0.8, 2, None).unwrap();

// Count noise points
let noise_count = cluster_labels.iter().filter(|&&label| label == labels::NOISE).count();

println!("Cluster assignments: {:?}", cluster_labels);
println!("Number of noise points: {}", noise_count);

Documentation

Key Enhancements

Production-Ready SciPy Compatibility

  • Complete API compatibility with SciPy's cluster module
  • Drop-in replacement for most SciPy clustering functions
  • Identical parameter names and behavior for seamless migration
  • Compatible return value formats with proper error handling

High-Performance Computing

  • SIMD acceleration with automatic fallback for unsupported hardware
  • Multi-core parallelism via Rayon for CPU-intensive operations
  • Memory-efficient streaming for datasets larger than available RAM
  • Optimized algorithms that outperform reference implementations

Rust Ecosystem Advantages

  • Memory safety without runtime overhead
  • Zero-copy operations where possible for maximum efficiency
  • Compile-time correctness with comprehensive type checking
  • Predictable performance with no garbage collection pauses

License

This project is Licensed under the Apache License 2.0. See LICENSE for details.

You can choose to use either license. See the LICENSE file for details.

Contributing

Contributions are welcome! Please see the project's CONTRIBUTING.md file for guidelines.