torsh-cluster
Unsupervised learning and clustering algorithms for ToRSh, powered by SciRS2.
Overview
This crate provides comprehensive clustering and unsupervised learning algorithms with a PyTorch-compatible API. It leverages scirs2-cluster for high-performance implementations while maintaining full integration with ToRSh's tensor operations and autograd system.
Features
- Partitioning Methods: K-Means, K-Medoids, Fuzzy C-Means
- Hierarchical Clustering: Agglomerative, Divisive, BIRCH
- Density-Based Methods: DBSCAN, OPTICS, HDBSCAN
- Distribution-Based: Gaussian Mixture Models (GMM), Expectation-Maximization
- Spectral Methods: Spectral clustering, Normalized cuts
- Deep Clustering: Deep Embedded Clustering (DEC), IDEC
- Evaluation Metrics: Silhouette score, Davies-Bouldin index, Calinski-Harabasz
- Initialization Strategies: K-means++, Random, Furthest-first
- GPU Acceleration: CUDA-accelerated clustering for large datasets
Usage
K-Means Clustering
use *;
use *;
// Create sample data
let data = tensor!;
// Initialize K-Means with 2 clusters
let kmeans = new
.max_iter
.tolerance
.init_method
.random_state;
// Fit the model
let result = kmeans.fit?;
// Get cluster assignments
let labels = result.labels;
println!;
// Get cluster centers
let centers = result.centers;
println!;
// Predict cluster for new data
let new_point = tensor!;
let predicted_cluster = kmeans.predict?;
DBSCAN - Density-Based Clustering
use *;
let data = load_dataset?;
// Initialize DBSCAN
let dbscan = DBSCANnew // eps=0.5, min_samples=5
.metric;
// Fit and predict
let labels = dbscan.fit_predict?;
// Points labeled -1 are considered noise
let noise_points = labels.iter.filter.count;
println!;
// Get core samples
let core_samples = dbscan.core_sample_indices?;
Gaussian Mixture Models (GMM)
use *;
let data = generate_blobs?; // 1000 samples, 3 features, 5 clusters
// Initialize GMM with 5 components
let gmm = new
.covariance_type
.max_iter
.n_init
.init_params;
// Fit the model
gmm.fit?;
// Predict cluster probabilities
let probabilities = gmm.predict_proba?;
println!;
// Get model parameters
let means = gmm.means;
let covariances = gmm.covariances;
let weights = gmm.weights;
// Compute BIC and AIC
let bic = gmm.bic?;
let aic = gmm.aic?;
Hierarchical Clustering
use *;
let data = tensor!;
// Agglomerative clustering
let hierarchical = new
.linkage
.affinity;
let labels = hierarchical.fit_predict?;
// Get dendrogram
let dendrogram = hierarchical.dendrogram;
// Compute cophenetic correlation
let cophenetic_corr = hierarchical.cophenetic_correlation?;
Spectral Clustering
use *;
let data = make_moons?; // Non-convex clusters
// Spectral clustering works well with non-convex shapes
let spectral = new
.affinity
.assign_labels
.random_state;
let labels = spectral.fit_predict?;
// Use custom affinity matrix
let affinity_matrix = compute_rbf_kernel?;
let spectral_custom = new
.affinity;
let labels = spectral_custom.fit_predict?;
Deep Embedded Clustering (DEC)
use *;
use *;
// Define autoencoder for feature learning
let autoencoder = new
.add
.add
.add
.add
.add
.add
.add; // 10-dimensional embedding
// Initialize DEC
let dec = new // 10 clusters
.update_interval
.tolerance
.batch_size;
// Pretrain autoencoder
dec.pretrain?;
// Cluster
let labels = dec.fit_predict?;
// Get cluster centers in embedding space
let centers = dec.cluster_centers;
Fuzzy C-Means
use *;
let data = tensor!;
// Fuzzy clustering allows soft assignments
let fcm = new
.fuzziness // Fuzziness parameter (m)
.max_iter
.tolerance;
let result = fcm.fit?;
// Get fuzzy membership matrix (each point belongs to all clusters with different probabilities)
let memberships = result.memberships;
println!;
// Get hard cluster assignments
let labels = result.labels; // Assigns to cluster with highest membership
OPTICS - Ordering Points To Identify Clustering Structure
use *;
let data = load_dataset?;
// OPTICS can find clusters of varying densities
let optics = OPTICSnew
.min_samples
.max_eps
.metric
.cluster_method;
let labels = optics.fit_predict?;
// Get reachability plot
let reachability = optics.reachability;
let ordering = optics.ordering;
// Extract clusters with different parameters
let labels_dbscan = optics.extract_dbscan?;
Evaluation Metrics
Clustering Quality
use *;
let data = generate_blobs?;
let labels = kmeans.fit_predict?;
// Silhouette score (-1 to 1, higher is better)
let silhouette = silhouette_score?;
println!;
// Davies-Bouldin index (lower is better)
let db_index = davies_bouldin_index?;
println!;
// Calinski-Harabasz index (higher is better)
let ch_index = calinski_harabasz_score?;
println!;
// Dunn index (higher is better)
let dunn = dunn_index?;
println!;
External Validation (when ground truth is available)
use *;
let true_labels = tensor!;
let pred_labels = tensor!;
// Adjusted Rand Index (-1 to 1, 1 is perfect)
let ari = adjusted_rand_score?;
// Normalized Mutual Information (0 to 1, 1 is perfect)
let nmi = normalized_mutual_info_score?;
// Fowlkes-Mallows score (0 to 1, 1 is perfect)
let fmi = fowlkes_mallows_score?;
// V-measure (0 to 1, 1 is perfect)
let v_measure = v_measure_score?;
// Homogeneity and completeness
let = homogeneity_completeness_v_measure?;
Initialization Methods
use *;
let data = randn?;
let n_clusters = 5;
// K-means++ initialization (smart initialization)
let centers = kmeans_plusplus?;
// Random initialization
let centers = random_init?;
// Furthest-first initialization
let centers = furthest_first?;
// Use custom initialization
let kmeans = new
.init_centers;
Advanced Features
Mini-Batch K-Means for Large Datasets
use *;
let large_data = randn?; // 1M samples
// Mini-batch K-Means for scalability
let mb_kmeans = new // 100 clusters
.batch_size
.max_iter
.reassignment_ratio;
let labels = mb_kmeans.fit_predict?;
Consensus Clustering
use *;
let data = generate_blobs?;
// Ensemble of clustering algorithms
let consensus = new
.add_clusterer
.add_clusterer
.add_clusterer
.n_runs
.aggregation_method;
let labels = consensus.fit_predict?;
GPU-Accelerated Clustering
use *;
let data = randn?.to_device?;
// K-Means on GPU
let kmeans = new
.device
.max_iter;
let labels = kmeans.fit_predict?;
Utilities
Elbow Method for Optimal K
use *;
let data = generate_blobs?;
// Try different numbers of clusters
let inertias = elbow_method?;
// Find the elbow point
let optimal_k = find_elbow?;
println!;
Silhouette Analysis
use *;
// Compute silhouette scores for different k
let silhouette_scores = silhouette_analysis?;
// Plot silhouette diagram for specific clustering
let labels = kmeans.fit_predict?;
let silhouette_values = silhouette_samples?;
plot_silhouette_diagram?;
Integration with SciRS2
This crate leverages the SciRS2 ecosystem for:
- High-performance clustering algorithms through
scirs2-cluster - Optimized tensor operations via
scirs2-core - Statistical functions from
scirs2-stats - Evaluation metrics through
scirs2-metrics - Linear algebra operations via
scirs2-linalg
All implementations follow the SciRS2 POLICY for consistent APIs and optimal performance.
Examples
See the examples/ directory for more detailed examples:
kmeans_clustering.rs- Basic K-Means clusteringdbscan_anomaly_detection.rs- Anomaly detection with DBSCANgmm_soft_clustering.rs- Probabilistic clustering with GMMhierarchical_dendrogram.rs- Hierarchical clustering and visualizationspectral_nonconvex.rs- Spectral clustering on non-convex shapesdeep_clustering.rs- Deep embedded clusteringlarge_scale_clustering.rs- Mini-batch K-Means for big data
Performance Tips
- Use Mini-Batch K-Means for datasets with >100k samples
- Enable GPU acceleration for large-scale clustering (>1M samples)
- Use K-means++ initialization for better convergence
- Apply PCA for dimensionality reduction before clustering high-dimensional data
- Use parallel features with
features = ["parallel"]in Cargo.toml
License
Licensed under the Apache License, Version 2.0. See LICENSE for details.