Struct linfa_clustering::AppxDbscan
source · [−]pub struct AppxDbscan;
Expand description
DBSCAN (Density-based Spatial Clustering of Applications with Noise)
clusters together neighbouring points, while points in sparse regions are labelled
as noise. Since points may be part of a cluster or noise the transform method returns
Array1<Option<usize>>
. It should be noted that some “border” points may technically
belong to more than one cluster but, since the transform function returns only one
label per point (if any), then only one cluster is chosen arbitrarily for those points.
As DBSCAN groups together points in dense regions, the number of clusters is determined by the dataset and distance tolerance not the user.
This is an implementation of the approximated version of DBSCAN with
complexity O(N). Additional information can be found in
this paper.
Beware of the hidden constant O((1/slack)^dimensionality)
of the complexity
for very small values of slack
and high dimensionalities.
The approximated version of the DBSCAN converges to the exact DBSCAN result
for a slack
that goes to zero. Since only the tranform
method is provided and
border points are not assigned deterministically, it may happen that the two
results still differ (in terms of border points) for very small values
of slack
.
Unlike regular DBSCAN, this algorithm only works with Euclidean (L2) distances, not other distance functions.
The algorithm
Let d
be the the number of features of each point:
- The d-dimensional space is divided in a grid of cells, each containing the points of the dataset that fall inside of them.
- All points that have at least
min_points
points in their neighbourhood of sizetolerance
are labeled as “core points”. Every cell containing at least one “core point” is labeled as a “core-cell”. - All “core cells” are set as vertexes of a graph and an arc is built between any two “core cells” according to the following rules:
- If there are two points in the two cells that have distance between them at most
tolerance
then the arc is added to the graph; - If there are two points in the two cells that have distance between them that is between
tolerance
andtolerance * (1 + slack)
then the arc is added arbitrarily to the graph; - Otherwise the arc is not added;
- If there are two points in the two cells that have distance between them at most
- Every connected component of the graph is a cluster and every core point in the CC is given the label of that cluster.
- For all the non-core points in the cells we search for any neighbouring core point (with the same arbirtary rule used to create the arcs)
belonging to a cluster and, if at least one is found, then the non-core point is given the label of the cluster of the core-point. If no such core
point is found then the point is a “noise” point and it is given a label of
None
.
Tutorial
Let’s do a walkthrough of an example running the approximated DBSCAN on some data.
use linfa_clustering::AppxDbscan;
use linfa::traits::Transformer;
use linfa_datasets::generate;
use ndarray::{Axis, array, s};
use ndarray_rand::rand::SeedableRng;
use rand_xoshiro::Xoshiro256Plus;
use approx::assert_abs_diff_eq;
// Our random number generator, seeded for reproducibility
let seed = 42;
let mut rng = Xoshiro256Plus::seed_from_u64(seed);
// `expected_centroids` has shape `(n_centroids, n_features)`
// i.e. three points in the 2-dimensional plane
let expected_centroids = array![[0., 1.], [-10., 20.], [-1., 10.]];
// Let's generate a synthetic dataset: three blobs of observations
// (100 points each) centered around our `expected_centroids`
let observations = generate::blobs(100, &expected_centroids, &mut rng);
// Let's configure and run our AppxDbscan algorithm
// We use the builder pattern to specify the hyperparameters
// `min_points` is the only mandatory parameter.
// If you don't specify the others (e.g. `tolerance`, `slack`)
// default values will be used.
let min_points = 3;
// Let's run the algorithm!
let params = AppxDbscan::params(min_points).tolerance(1e-2).slack(1e-3).transform(&observations).unwrap();
// Points are `None` if noise `Some(id)` if belonging to a cluster.
Implementations
sourceimpl AppxDbscan
impl AppxDbscan
sourcepub fn params<F: Float>(
min_points: usize
) -> AppxDbscanParams<F, CommonNearestNeighbour>
pub fn params<F: Float>(
min_points: usize
) -> AppxDbscanParams<F, CommonNearestNeighbour>
Configures the hyperparameters with the minimum number of points required to form a cluster
Defaults are provided if the optional parameters are not specified:
tolerance = 1e-4
slack = 1e-2
nn_algo = KdTree
pub fn params_with<F: Float, N>(
min_points: usize,
nn_algo: N
) -> AppxDbscanParams<F, N>
Trait Implementations
sourceimpl Clone for AppxDbscan
impl Clone for AppxDbscan
sourcefn clone(&self) -> AppxDbscan
fn clone(&self) -> AppxDbscan
Returns a copy of the value. Read more
1.0.0 · sourcefn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
Performs copy-assignment from source
. Read more
sourceimpl Debug for AppxDbscan
impl Debug for AppxDbscan
sourceimpl PartialEq<AppxDbscan> for AppxDbscan
impl PartialEq<AppxDbscan> for AppxDbscan
impl Eq for AppxDbscan
impl StructuralEq for AppxDbscan
impl StructuralPartialEq for AppxDbscan
Auto Trait Implementations
impl RefUnwindSafe for AppxDbscan
impl Send for AppxDbscan
impl Sync for AppxDbscan
impl Unpin for AppxDbscan
impl UnwindSafe for AppxDbscan
Blanket Implementations
sourceimpl<T> BorrowMut<T> for T where
T: ?Sized,
impl<T> BorrowMut<T> for T where
T: ?Sized,
const: unstable · sourcefn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Mutably borrows from an owned value. Read more
sourceimpl<Q, K> Equivalent<K> for Q where
Q: Eq + ?Sized,
K: Borrow<Q> + ?Sized,
impl<Q, K> Equivalent<K> for Q where
Q: Eq + ?Sized,
K: Borrow<Q> + ?Sized,
sourcefn equivalent(&self, key: &K) -> bool
fn equivalent(&self, key: &K) -> bool
Compare self to key
and return true
if they are equal.