SciRS2 Clustering Module
A comprehensive clustering module for the SciRS2 scientific computing library in Rust (v0.1.5). Following the SciRS2 POLICY, this crate provides production-ready implementations of various clustering algorithms with a focus on performance, SciPy compatibility, ecosystem consistency, and idiomatic Rust code.
Production Readiness - stable Release
🎯 Version 0.1.0 (SciRS2 POLICY & Enhanced Performance) is ready for production use with:
- 189+ comprehensive tests covering all algorithms and edge cases
- Zero warnings policy enforced across all code and examples
- Full SciPy API compatibility maintained for seamless migration
- Extensive documentation with working examples for all features
- Performance optimizations including SIMD and parallel processing
Stability & Performance
Algorithm Maturity
- Core algorithms (K-means, Hierarchical, DBSCAN) are thoroughly tested and production-ready
- Advanced algorithms (Spectral, BIRCH, GMM, HDBSCAN) are fully implemented with comprehensive test coverage
- All APIs are stable and maintain backward compatibility with SciPy interfaces
Performance Characteristics
- Optimized Ward's method: O(n² log n) complexity vs standard O(n³)
- SIMD acceleration: Up to 4x faster distance computations on supported hardware
- Parallel processing: Multi-core implementations for K-means and hierarchical clustering
- Memory efficiency: Streaming and chunked processing for large datasets (>10M points)
Features
-
Vector Quantization
- K-means clustering with multiple initialization methods
- K-means++ smart initialization
- kmeans2 with SciPy-compatible interface
- Mini-batch K-means for large datasets
- Parallel K-means for multi-core systems
- Data whitening/normalization utilities
-
Hierarchical Clustering
- Agglomerative clustering with multiple linkage methods:
- Single linkage (minimum distance)
- Complete linkage (maximum distance)
- Average linkage
- Ward's method (minimizes variance)
- Centroid method (distance between centroids)
- Median method
- Weighted average
- Dendrogram utilities and flat cluster extraction
- Cluster distance metrics (Euclidean, Manhattan, Chebyshev, Correlation)
- Agglomerative clustering with multiple linkage methods:
-
Density-Based Clustering
- DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
- OPTICS (Ordering Points To Identify the Clustering Structure)
- HDBSCAN (Hierarchical DBSCAN)
- Support for custom distance metrics
-
Other Algorithms
- Mean-shift clustering
- Spectral clustering
- Affinity propagation
- BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies)
- Gaussian Mixture Models (GMM)
- Leader algorithm (single-pass clustering with hierarchical tree support)
-
Evaluation Metrics
- Silhouette coefficient
- Davies-Bouldin index
- Calinski-Harabasz index
- Adjusted Rand Index
- Normalized Mutual Information
- Homogeneity, Completeness, and V-measure
Installation
Add this to your Cargo.toml:
[]
= "0.1.5"
= "0.15"
To enable optimizations through the core module, add feature flags:
[]
= { = "0.1.5", = ["parallel", "simd"] }
Usage
K-means Example
use Array2;
use ;
// Create a dataset
let data = from_shape_vec.unwrap;
// Configure K-means
let options = KMeansOptions ;
// Run k-means with k=2
let = kmeans.unwrap;
println!;
println!;
kmeans2 (SciPy-compatible)
use ;
// Whiten the data for better clustering
let whitened_data = whiten.unwrap;
// Run kmeans2 with different initialization methods
let = kmeans2.unwrap;
Mini-batch K-means
use ;
// Configure mini-batch K-means
let options = MiniBatchKMeansOptions ;
// Run clustering on large dataset
let = minibatch_kmeans.unwrap;
Hierarchical Clustering Example
use Array2;
use ;
// Create a dataset
let data = from_shape_vec.unwrap;
// Calculate linkage matrix using Ward's method
let linkage_matrix = linkage.unwrap;
// Form flat clusters by cutting the dendrogram
let num_clusters = 2;
let labels = fcluster.unwrap;
println!;
Evaluation Metrics
use ;
// Evaluate clustering quality
let silhouette = silhouette_score.unwrap;
let db_score = davies_bouldin_score.unwrap;
let ch_score = calinski_harabasz_score.unwrap;
println!;
println!;
println!;
DBSCAN Example
use Array2;
use ;
// Create a dataset with clusters and noise
let data = from_shape_vec.unwrap;
// Run DBSCAN with eps=0.8 and min_samples=2
let cluster_labels = dbscan.unwrap;
// Count noise points
let noise_count = cluster_labels.iter.filter.count;
println!;
println!;
Documentation
- Algorithm Comparison Guide - Comprehensive guide to choosing the right clustering algorithm for your use case
Key Enhancements
Production-Ready SciPy Compatibility
- Complete API compatibility with SciPy's cluster module
- Drop-in replacement for most SciPy clustering functions
- Identical parameter names and behavior for seamless migration
- Compatible return value formats with proper error handling
High-Performance Computing
- SIMD acceleration with automatic fallback for unsupported hardware
- Multi-core parallelism via Rayon for CPU-intensive operations
- Memory-efficient streaming for datasets larger than available RAM
- Optimized algorithms that outperform reference implementations
Rust Ecosystem Advantages
- Memory safety without runtime overhead
- Zero-copy operations where possible for maximum efficiency
- Compile-time correctness with comprehensive type checking
- Predictable performance with no garbage collection pauses
License
This project is Licensed under the Apache License 2.0. See LICENSE for details.
You can choose to use either license. See the LICENSE file for details.
Contributing
Contributions are welcome! Please see the project's CONTRIBUTING.md file for guidelines.