SciRS2 Clustering Module

A comprehensive clustering module for the SciRS2 scientific computing library in Rust (v0.1.0). Following the SciRS2 POLICY, this crate provides production-ready implementations of various clustering algorithms with a focus on performance, SciPy compatibility, ecosystem consistency, and idiomatic Rust code.

Production Readiness - stable Release

🎯 Version 0.1.0 (SciRS2 POLICY & Enhanced Performance) is ready for production use with:

189+ comprehensive tests covering all algorithms and edge cases
Zero warnings policy enforced across all code and examples
Full SciPy API compatibility maintained for seamless migration
Extensive documentation with working examples for all features
Performance optimizations including SIMD and parallel processing

Stability & Performance

Algorithm Maturity

Core algorithms (K-means, Hierarchical, DBSCAN) are thoroughly tested and production-ready
Advanced algorithms (Spectral, BIRCH, GMM, HDBSCAN) are fully implemented with comprehensive test coverage
All APIs are stable and maintain backward compatibility with SciPy interfaces

Performance Characteristics

Optimized Ward's method: O(n² log n) complexity vs standard O(n³)
SIMD acceleration: Up to 4x faster distance computations on supported hardware
Parallel processing: Multi-core implementations for K-means and hierarchical clustering
Memory efficiency: Streaming and chunked processing for large datasets (>10M points)

Features

Vector Quantization
- K-means clustering with multiple initialization methods
- K-means++ smart initialization
- kmeans2 with SciPy-compatible interface
- Mini-batch K-means for large datasets
- Parallel K-means for multi-core systems
- Data whitening/normalization utilities
Hierarchical Clustering
- Agglomerative clustering with multiple linkage methods:
  - Single linkage (minimum distance)
  - Complete linkage (maximum distance)
  - Average linkage
  - Ward's method (minimizes variance)
  - Centroid method (distance between centroids)
  - Median method
  - Weighted average
- Dendrogram utilities and flat cluster extraction
- Cluster distance metrics (Euclidean, Manhattan, Chebyshev, Correlation)
Density-Based Clustering
- DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
- OPTICS (Ordering Points To Identify the Clustering Structure)
- HDBSCAN (Hierarchical DBSCAN)
- Support for custom distance metrics
Other Algorithms
- Mean-shift clustering
- Spectral clustering
- Affinity propagation
- BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies)
- Gaussian Mixture Models (GMM)
- Leader algorithm (single-pass clustering with hierarchical tree support)
Evaluation Metrics
- Silhouette coefficient
- Davies-Bouldin index
- Calinski-Harabasz index
- Adjusted Rand Index
- Normalized Mutual Information
- Homogeneity, Completeness, and V-measure

Installation

Add this to your Cargo.toml:

[dependencies]
scirs2-cluster = "0.1.0"
ndarray = "0.15"

To enable optimizations through the core module, add feature flags:

[dependencies]
scirs2-cluster = { version = "0.1.0", features = ["parallel", "simd"] }

Usage

K-means Example

use ndarray::Array2;
use scirs2_cluster::vq::{kmeans, KMeansOptions, KMeansInit};

// Create a dataset
let data = Array2::from_shape_vec((6, 2), vec![
    1.0, 2.0,
    1.2, 1.8,
    0.8, 1.9,
    3.7, 4.2,
    3.9, 3.9,
    4.2, 4.1,
]).unwrap();

// Configure K-means
let options = KMeansOptions {
    init_method: KMeansInit::KMeansPlusPlus,
    max_iter: 300,
    ..Default::default()
};

// Run k-means with k=2
let (centroids, labels) = kmeans(data.view(), 2, Some(options)).unwrap();

println!("Centroids: {:?}", centroids);
println!("Cluster assignments: {:?}", labels);

kmeans2 (SciPy-compatible)

use scirs2_cluster::vq::{kmeans2, MinitMethod, MissingMethod, whiten};

// Whiten the data for better clustering
let whitened_data = whiten(&data).unwrap();

// Run kmeans2 with different initialization methods
let (centroids, labels) = kmeans2(
    whitened_data.view(),
    3,                             // k clusters
    Some(10),                      // iterations
    Some(1e-4),                    // threshold
    Some(MinitMethod::PlusPlus),   // K-means++ initialization
    Some(MissingMethod::Warn),     // warn on empty clusters
    Some(true),                    // check finite values
    Some(42),                      // random seed
).unwrap();

Mini-batch K-means

use scirs2_cluster::vq::{minibatch_kmeans, MiniBatchKMeansOptions};

// Configure mini-batch K-means
let options = MiniBatchKMeansOptions {
    batch_size: 1024,
    max_iter: 100,
    ..Default::default()
};

// Run clustering on large dataset
let (centroids, labels) = minibatch_kmeans(large_data.view(), 5, Some(options)).unwrap();

Hierarchical Clustering Example

use ndarray::Array2;
use scirs2_cluster::hierarchy::{linkage, fcluster, LinkageMethod};

// Create a dataset
let data = Array2::from_shape_vec((6, 2), vec![
    1.0, 2.0,
    1.2, 1.8,
    0.8, 1.9,
    3.7, 4.2,
    3.9, 3.9,
    4.2, 4.1,
]).unwrap();

// Calculate linkage matrix using Ward's method
let linkage_matrix = linkage(data.view(), LinkageMethod::Ward, None).unwrap();

// Form flat clusters by cutting the dendrogram
let num_clusters = 2;
let labels = fcluster(&linkage_matrix, num_clusters, None).unwrap();

println!("Cluster assignments: {:?}", labels);

Evaluation Metrics

use scirs2_cluster::metrics::{silhouette_score, davies_bouldin_score, calinski_harabasz_score};

// Evaluate clustering quality
let silhouette = silhouette_score(data.view(), labels.view()).unwrap();
let db_score = davies_bouldin_score(data.view(), labels.view()).unwrap();
let ch_score = calinski_harabasz_score(data.view(), labels.view()).unwrap();

println!("Silhouette score: {}", silhouette);
println!("Davies-Bouldin score: {}", db_score);
println!("Calinski-Harabasz score: {}", ch_score);

DBSCAN Example

use ndarray::Array2;
use scirs2_cluster::density::{dbscan, labels};

// Create a dataset with clusters and noise
let data = Array2::from_shape_vec((8, 2), vec![
    1.0, 2.0,   // Cluster 1
    1.5, 1.8,   // Cluster 1
    1.3, 1.9,   // Cluster 1
    5.0, 7.0,   // Cluster 2
    5.1, 6.8,   // Cluster 2
    5.2, 7.1,   // Cluster 2
    0.0, 10.0,  // Noise
    10.0, 0.0,  // Noise
]).unwrap();

// Run DBSCAN with eps=0.8 and min_samples=2
let cluster_labels = dbscan(data.view(), 0.8, 2, None).unwrap();

// Count noise points
let noise_count = cluster_labels.iter().filter(|&&label| label == labels::NOISE).count();

println!("Cluster assignments: {:?}", cluster_labels);
println!("Number of noise points: {}", noise_count);

Documentation

Algorithm Comparison Guide - Comprehensive guide to choosing the right clustering algorithm for your use case

Key Enhancements

Production-Ready SciPy Compatibility

Complete API compatibility with SciPy's cluster module
Drop-in replacement for most SciPy clustering functions
Identical parameter names and behavior for seamless migration
Compatible return value formats with proper error handling

High-Performance Computing

SIMD acceleration with automatic fallback for unsupported hardware
Multi-core parallelism via Rayon for CPU-intensive operations
Memory-efficient streaming for datasets larger than available RAM
Optimized algorithms that outperform reference implementations

Rust Ecosystem Advantages

Memory safety without runtime overhead
Zero-copy operations where possible for maximum efficiency
Compile-time correctness with comprehensive type checking
Predictable performance with no garbage collection pauses

License

This project is dual-licensed under:

You can choose to use either license. See the LICENSE file for details.

Contributing

Contributions are welcome! Please see the project's CONTRIBUTING.md file for guidelines.

scirs2-cluster 0.1.1