SciRS2 Clustering Module
A comprehensive clustering module for the SciRS2 scientific computing library in Rust (v0.1.0). Following the SciRS2 POLICY, this crate provides production-ready implementations of various clustering algorithms with a focus on performance, SciPy compatibility, ecosystem consistency, and idiomatic Rust code.
Production Readiness - stable Release
🎯 Version 0.1.0 (SciRS2 POLICY & Enhanced Performance) is ready for production use with:
- 189+ comprehensive tests covering all algorithms and edge cases
- Zero warnings policy enforced across all code and examples
- Full SciPy API compatibility maintained for seamless migration
- Extensive documentation with working examples for all features
- Performance optimizations including SIMD and parallel processing
Stability & Performance
Algorithm Maturity
- Core algorithms (K-means, Hierarchical, DBSCAN) are thoroughly tested and production-ready
- Advanced algorithms (Spectral, BIRCH, GMM, HDBSCAN) are fully implemented with comprehensive test coverage
- All APIs are stable and maintain backward compatibility with SciPy interfaces
Performance Characteristics
- Optimized Ward's method: O(n² log n) complexity vs standard O(n³)
- SIMD acceleration: Up to 4x faster distance computations on supported hardware
- Parallel processing: Multi-core implementations for K-means and hierarchical clustering
- Memory efficiency: Streaming and chunked processing for large datasets (>10M points)
Features
-
Vector Quantization
- K-means clustering with multiple initialization methods
- K-means++ smart initialization
- kmeans2 with SciPy-compatible interface
- Mini-batch K-means for large datasets
- Parallel K-means for multi-core systems
- Data whitening/normalization utilities
-
Hierarchical Clustering
- Agglomerative clustering with multiple linkage methods:
- Single linkage (minimum distance)
- Complete linkage (maximum distance)
- Average linkage
- Ward's method (minimizes variance)
- Centroid method (distance between centroids)
- Median method
- Weighted average
- Dendrogram utilities and flat cluster extraction
- Cluster distance metrics (Euclidean, Manhattan, Chebyshev, Correlation)
- Agglomerative clustering with multiple linkage methods:
-
Density-Based Clustering
- DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
- OPTICS (Ordering Points To Identify the Clustering Structure)
- HDBSCAN (Hierarchical DBSCAN)
- Support for custom distance metrics
-
Other Algorithms
- Mean-shift clustering
- Spectral clustering
- Affinity propagation
- BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies)
- Gaussian Mixture Models (GMM)
- Leader algorithm (single-pass clustering with hierarchical tree support)
-
Evaluation Metrics
- Silhouette coefficient
- Davies-Bouldin index
- Calinski-Harabasz index
- Adjusted Rand Index
- Normalized Mutual Information
- Homogeneity, Completeness, and V-measure
Installation
Add this to your Cargo.toml:
[]
= "0.1.0"
= "0.15"
To enable optimizations through the core module, add feature flags:
[]
= { = "0.1.0", = ["parallel", "simd"] }
Usage
K-means Example
use Array2;
use ;
// Create a dataset
let data = from_shape_vec.unwrap;
// Configure K-means
let options = KMeansOptions ;
// Run k-means with k=2
let = kmeans.unwrap;
println!;
println!;
kmeans2 (SciPy-compatible)
use ;
// Whiten the data for better clustering
let whitened_data = whiten.unwrap;
// Run kmeans2 with different initialization methods
let = kmeans2.unwrap;
Mini-batch K-means
use ;
// Configure mini-batch K-means
let options = MiniBatchKMeansOptions ;
// Run clustering on large dataset
let = minibatch_kmeans.unwrap;
Hierarchical Clustering Example
use Array2;
use ;
// Create a dataset
let data = from_shape_vec.unwrap;
// Calculate linkage matrix using Ward's method
let linkage_matrix = linkage.unwrap;
// Form flat clusters by cutting the dendrogram
let num_clusters = 2;
let labels = fcluster.unwrap;
println!;
Evaluation Metrics
use ;
// Evaluate clustering quality
let silhouette = silhouette_score.unwrap;
let db_score = davies_bouldin_score.unwrap;
let ch_score = calinski_harabasz_score.unwrap;
println!;
println!;
println!;
DBSCAN Example
use Array2;
use ;
// Create a dataset with clusters and noise
let data = from_shape_vec.unwrap;
// Run DBSCAN with eps=0.8 and min_samples=2
let cluster_labels = dbscan.unwrap;
// Count noise points
let noise_count = cluster_labels.iter.filter.count;
println!;
println!;
Documentation
- Algorithm Comparison Guide - Comprehensive guide to choosing the right clustering algorithm for your use case
Key Enhancements
Production-Ready SciPy Compatibility
- Complete API compatibility with SciPy's cluster module
- Drop-in replacement for most SciPy clustering functions
- Identical parameter names and behavior for seamless migration
- Compatible return value formats with proper error handling
High-Performance Computing
- SIMD acceleration with automatic fallback for unsupported hardware
- Multi-core parallelism via Rayon for CPU-intensive operations
- Memory-efficient streaming for datasets larger than available RAM
- Optimized algorithms that outperform reference implementations
Rust Ecosystem Advantages
- Memory safety without runtime overhead
- Zero-copy operations where possible for maximum efficiency
- Compile-time correctness with comprehensive type checking
- Predictable performance with no garbage collection pauses
License
This project is dual-licensed under:
You can choose to use either license. See the LICENSE file for details.
Contributing
Contributions are welcome! Please see the project's CONTRIBUTING.md file for guidelines.