clump 0.3.0

Dense clustering primitives (k-means, DBSCAN, HDBSCAN, EVoC)
Documentation

clump

crates.io Documentation CI

Dense clustering primitives (k-means, DBSCAN, HDBSCAN, EVoC).

Dual-licensed under MIT or Apache-2.0.

Quickstart

[dependencies]
clump = "0.3.0"
use clump::{Clustering, Dbscan, Kmeans};

let data = vec![
    vec![0.0, 0.0],
    vec![0.1, 0.1],
    vec![1.0, 1.0],
    vec![10.0, 10.0],
    vec![11.0, 11.0],
];

// Hard clustering with k-means (default: squared Euclidean distance)
let labels = Kmeans::new(2).with_seed(42).fit_predict(&data).unwrap();

assert_eq!(labels.len(), data.len());
assert_eq!(labels[0], labels[1]); // near each other
assert_ne!(labels[0], labels[2]); // far apart

// Density clustering with DBSCAN (default: Euclidean distance)
let labels = Dbscan::new(0.5, 2).fit_predict(&data).unwrap();
assert_eq!(labels.len(), data.len());

Distance metrics

All clustering algorithms are generic over the DistanceMetric trait. The default metric differs per algorithm:

Algorithm Default metric Constructor
Kmeans SquaredEuclidean Kmeans::new(k)
Dbscan Euclidean Dbscan::new(eps, min_pts)
Hdbscan Euclidean Hdbscan::new()
EVoC SquaredEuclidean EVoC::new(params)

Built-in metrics

Metric Formula
SquaredEuclidean sum((a_i - b_i)^2)
Euclidean sqrt(sum((a_i - b_i)^2))
CosineDistance 1 - cos_sim(a, b), range [0, 2]
InnerProductDistance -dot(a, b)

Using a different built-in metric

Each algorithm provides a with_metric constructor:

use clump::{Kmeans, CosineDistance, Clustering};

let data = vec![
    vec![1.0, 0.0],
    vec![0.9, 0.1],
    vec![0.0, 1.0],
    vec![0.1, 0.9],
];

let labels = Kmeans::with_metric(2, CosineDistance)
    .with_seed(42)
    .fit_predict(&data)
    .unwrap();

assert_eq!(labels[0], labels[1]);
assert_ne!(labels[0], labels[2]);

The same pattern works for the other algorithms:

use clump::{Dbscan, Hdbscan, CosineDistance, SquaredEuclidean, Clustering};

# let data = vec![vec![1.0, 0.0], vec![0.9, 0.1], vec![0.0, 1.0], vec![0.1, 0.9]];
// DBSCAN with cosine distance (epsilon is in cosine distance units)
let _labels = Dbscan::with_metric(0.1, 2, CosineDistance)
    .fit_predict(&data)
    .unwrap();

// HDBSCAN with squared Euclidean distance
let _labels = Hdbscan::with_metric(SquaredEuclidean)
    .with_min_samples(2)
    .with_min_cluster_size(2)
    .fit_predict(&data)
    .unwrap();

Implementing a custom metric

Implement the DistanceMetric trait:

use clump::{DistanceMetric, Kmeans, Clustering};

/// Manhattan (L1) distance.
#[derive(Clone)]
struct Manhattan;

impl DistanceMetric for Manhattan {
    fn distance(&self, a: &[f32], b: &[f32]) -> f32 {
        a.iter().zip(b).map(|(x, y)| (x - y).abs()).sum()
    }
}

let data = vec![
    vec![0.0, 0.0],
    vec![0.1, 0.1],
    vec![10.0, 10.0],
    vec![10.1, 10.1],
];

let labels = Kmeans::with_metric(2, Manhattan)
    .with_seed(42)
    .fit_predict(&data)
    .unwrap();

assert_eq!(labels[0], labels[1]);
assert_ne!(labels[0], labels[2]);

EVoC

EVoC (Embedding Vector Oriented Clustering) produces hierarchical clusters with multiple granularity layers and near-duplicate detection:

use clump::{EVoC, EVoCParams, Clustering};

let data = vec![
    vec![0.0, 0.0],
    vec![0.1, 0.1],
    vec![10.0, 10.0],
    vec![11.0, 11.0],
];

let mut evoc = EVoC::new(EVoCParams {
    intermediate_dim: 1,
    min_cluster_size: 2,
    seed: Some(42),
    ..Default::default()
});
let labels = evoc.fit_predict(&data).unwrap();
assert_eq!(labels.len(), data.len());
assert!(!evoc.cluster_layers().is_empty());

Examples

clustering.rs -- K-means, DBSCAN, and HDBSCAN on the same dataset. Generates three well-separated 2D clusters and runs all three algorithms to compare their behavior: k-means needs the cluster count up front, DBSCAN discovers it from density, and HDBSCAN adapts without an epsilon parameter. A good starting point for choosing between algorithms.

cargo run --example clustering

Notes

  • Dbscan::fit_predict returns a label for every point; noise points are assigned to a special cluster (clump::NOISE). If you want Option labels, use Dbscan::fit_predict_with_noise (or import the DbscanExt trait).
  • Kmeans::fit returns centroids + labels (KmeansFit), which you can reuse to predict labels for new points.
  • docs.rs/clump currently documents the latest crates.io release. If you depend on git main, prefer local rustdoc (cargo doc --open) for up-to-date docs.

License

MIT OR Apache-2.0