clump
Dense clustering primitives (k-means, DBSCAN, HDBSCAN, EVoC).
Dual-licensed under MIT or Apache-2.0.
Quickstart
[]
= "0.3.0"
use ;
let data = vec!;
// Hard clustering with k-means (default: squared Euclidean distance)
let labels = new.with_seed.fit_predict.unwrap;
assert_eq!;
assert_eq!; // near each other
assert_ne!; // far apart
// Density clustering with DBSCAN (default: Euclidean distance)
let labels = new.fit_predict.unwrap;
assert_eq!;
Distance metrics
All clustering algorithms are generic over the DistanceMetric trait. The
default metric differs per algorithm:
| Algorithm | Default metric | Constructor |
|---|---|---|
Kmeans |
SquaredEuclidean |
Kmeans::new(k) |
Dbscan |
Euclidean |
Dbscan::new(eps, min_pts) |
Hdbscan |
Euclidean |
Hdbscan::new() |
EVoC |
SquaredEuclidean |
EVoC::new(params) |
Built-in metrics
| Metric | Formula |
|---|---|
SquaredEuclidean |
sum((a_i - b_i)^2) |
Euclidean |
sqrt(sum((a_i - b_i)^2)) |
CosineDistance |
1 - cos_sim(a, b), range [0, 2] |
InnerProductDistance |
-dot(a, b) |
Using a different built-in metric
Each algorithm provides a with_metric constructor:
use ;
let data = vec!;
let labels = with_metric
.with_seed
.fit_predict
.unwrap;
assert_eq!;
assert_ne!;
The same pattern works for the other algorithms:
use ;
# let data = vec!;
// DBSCAN with cosine distance (epsilon is in cosine distance units)
let _labels = with_metric
.fit_predict
.unwrap;
// HDBSCAN with squared Euclidean distance
let _labels = with_metric
.with_min_samples
.with_min_cluster_size
.fit_predict
.unwrap;
Implementing a custom metric
Implement the DistanceMetric trait:
use ;
/// Manhattan (L1) distance.
;
let data = vec!;
let labels = with_metric
.with_seed
.fit_predict
.unwrap;
assert_eq!;
assert_ne!;
EVoC
EVoC (Embedding Vector Oriented Clustering) produces hierarchical clusters with multiple granularity layers and near-duplicate detection:
use ;
let data = vec!;
let mut evoc = new;
let labels = evoc.fit_predict.unwrap;
assert_eq!;
assert!;
Examples
clustering.rs -- K-means, DBSCAN, and HDBSCAN on the same dataset. Generates three well-separated 2D clusters and runs all three algorithms to compare their behavior: k-means needs the cluster count up front, DBSCAN discovers it from density, and HDBSCAN adapts without an epsilon parameter. A good starting point for choosing between algorithms.
Notes
Dbscan::fit_predictreturns a label for every point; noise points are assigned to a special cluster (clump::NOISE). If you wantOptionlabels, useDbscan::fit_predict_with_noise(or import theDbscanExttrait).Kmeans::fitreturns centroids + labels (KmeansFit), which you can reuse topredictlabels for new points.docs.rs/clumpcurrently documents the latest crates.io release. If you depend on git main, prefer local rustdoc (cargo doc --open) for up-to-date docs.
License
MIT OR Apache-2.0