Crate linfa_clustering

Expand description

linfa-clustering aims to provide pure Rust implementations of popular clustering algorithms.

The big picture

linfa-clustering is a crate in the linfa ecosystem, a wider effort to bootstrap a toolkit for classical Machine Learning implemented in pure Rust, kin in spirit to Python’s scikit-learn.

You can find a roadmap (and a selection of good first issues) here - contributors are more than welcome!

Current state

Right now linfa-clustering provides the following clustering algorithms:

Implementation choices, algorithmic details and tutorials can be found in the page dedicated to the specific algorithms.

DBSCAN (Density-based Spatial Clustering of Applications with Noise) clusters together neighbouring points, while points in sparse regions are labelled as noise. Since points may be part of a cluster or noise the transform method returns Array1<Option<usize>>. It should be noted that some “border” points may technically belong to more than one cluster but, since the transform function returns only one label per point (if any), then only one cluster is chosen arbitrarily for those points.

AppxDbscanParams

Helper struct for building a set of Approximated DBSCAN hyperparameters

AppxDbscanValidParams

The set of hyperparameters that can be specified for the execution of the Approximated DBSCAN algorithm.

Dbscan

DBSCAN (Density-based Spatial Clustering of Applications with Noise) clusters together points which are close together with enough neighbors labelled points which are sparsely neighbored as noise. As points may be part of a cluster or noise the predict method returns Array1<Option<usize>>

DbscanParams

Helper struct for building a set of DBSCAN hyperparameters

DbscanValidParams

The set of hyperparameters that can be specified for the execution of the DBSCAN algorithm.

GaussianMixtureModel

Gaussian Mixture Model (GMM) aims at clustering a dataset by finding normally distributed sub datasets (hence the Gaussian Mixture name) .

GmmParams

The set of hyperparameters that can be specified for the execution of the GMM algorithm.

GmmValidParams

The set of hyperparameters that can be specified for the execution of the GMM algorithm.

KMeans

K-means clustering aims to partition a set of unlabeled observations into clusters, where each observation belongs to the cluster with the nearest mean.

KMeansParams

An helper struct used to construct a set of valid hyperparameters for the K-means algorithm (using the builder pattern).

KMeansValidParams

The set of hyperparameters that can be specified for the execution of the K-means algorithm.

Optics

OPTICS (Ordering Points To Identify Clustering Structure) is a clustering algorithm that doesn’t explicitly cluster the data but instead creates an “augmented ordering” of the dataset representing it’s density-based clustering structure. This ordering contains information which is equivalent to the density-based clusterings and can then be used for automatic and interactive cluster analysis.

OpticsAnalysis

The analysis from running OPTICS on a dataset, this allows you iterate over the data points and access their core and reachability distances. The ordering of the points also doesn’t match that of the dataset instead ordering based on the clustering structure worked out during analysis.

OpticsParams

OpticsValidParams

The set of hyperparameters that can be specified for the execution of the OPTICS algorithm.

Sample

This struct represents a data point in the dataset with it’s associated distances obtained from the OPTICS analysis