1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
//! Clustering algorithms for grouping similar items.
//!
//! This module provides clustering algorithms used to group items before
//! summarization in hierarchical trees.
//!
//! ## Hard vs Soft Clustering
//!
//! **Hard clustering** assigns each item to exactly one cluster. Simple, but
//! loses information when items genuinely span multiple groups.
//!
//! **Soft clustering** gives each item a probability distribution over clusters.
//! A text chunk might be 60% about "machine learning", 30% about "statistics",
//! 10% about "software". This reflects reality better than forcing a choice.
//!
//! For RAPTOR-style hierarchical summarization, **soft clustering (GMM) is
//! recommended** because text chunks often span multiple topics.
//!
//! ## Algorithms
//!
//! ### Gaussian Mixture Model (GMM)
//!
//! Models data as a mixture of k Gaussian distributions:
//!
//! ```text
//! P(x) = Σ π_k × N(x | μ_k, Σ_k)
//! ```
//!
//! Where π_k is the mixture weight (probability of cluster k), and N is the
//! Gaussian density with mean μ_k and covariance Σ_k.
//!
//! **EM Algorithm**:
//! 1. **E-step**: Compute P(cluster k | point x) for each point
//! 2. **M-step**: Update μ, Σ, π to maximize likelihood
//! 3. Repeat until convergence
//!
//! **When to use**: When you want soft assignments, or when clusters have
//! different shapes/sizes (unlike K-means which assumes spherical clusters).
//!
//! ### K-means
//!
//! The classic algorithm: assign each point to the nearest centroid, then
//! update centroids to the mean of their points. Repeat.
//!
//! **Objective**: Minimize within-cluster sum of squares:
//!
//! ```text
//! J = Σ_k Σ_{x ∈ C_k} ||x - μ_k||²
//! ```
//!
//! **Assumptions**:
//! - Clusters are roughly spherical
//! - Clusters have similar sizes
//! - You know k in advance
//!
//! **When to use**: Fast initial exploration, or when you need hard assignments
//! and can accept the spherical assumption.
//!
//! ### Hierarchical (Agglomerative) Clustering
//!
//! Bottom-up: start with each point as its own cluster, repeatedly merge
//! the two closest clusters until one remains. The merge history forms a
//! **dendrogram**—a binary tree you can cut at any height to get k clusters.
//!
//! **Linkage methods** determine "distance between clusters":
//!
//! | Linkage | Distance | Effect |
//! |---------|----------|--------|
//! | Single | min(pairwise) | Chaining; elongated clusters |
//! | Complete | max(pairwise) | Compact, spherical clusters |
//! | Average | mean(pairwise) | Balanced compromise |
//! | Ward | Variance increase | Minimizes within-cluster variance |
//!
//! **When to use**: When you want to explore cluster structure at multiple
//! granularities, or don't know k in advance.
//!
//! ## Usage
//!
//! ```rust
//! use sheaf::cluster::{Kmeans, Gmm, Clustering, SoftClustering};
//!
//! let data = vec![
//! vec![0.0, 0.0],
//! vec![0.1, 0.1],
//! vec![10.0, 10.0],
//! vec![10.1, 10.1],
//! ];
//!
//! // Hard clustering with K-means
//! let labels = Kmeans::new(2).fit_predict(&data).unwrap();
//! assert_eq!(labels[0], labels[1]); // First two together
//! assert_ne!(labels[0], labels[2]); // Separate from last two
//!
//! // Soft clustering with GMM
//! let probs = Gmm::new()
//! .with_n_components(2)
//! .fit_predict_proba(&data)
//! .unwrap();
//! // probs[i][k] = P(point i belongs to cluster k)
//! ```
pub use ;
pub use Gmm;
pub use ;
pub use ItDendrogram;
pub use Kmeans;
pub use ;
pub use ;
// Re-export distance metric types from clump so users don't need a direct
// clump dependency to select a metric.
pub use ;
// Constrained clustering (COP-Kmeans, Wagstaff et al. 2001).
pub use ;
// Streaming/online clustering (Mini-Batch K-means, Sculley 2010; DenStream, Cao et al. 2006).
pub use DenStream;
pub use MiniBatchKmeans;
// Correlation clustering (Bansal et al. 2004) -- no k required.
pub use ;
// Data input abstractions (zero-copy flat buffer, ndarray views).
pub use ;
pub use ;