Expand description
§kmeans - API documentation
Kmeans is a small rust library for the calculation of k-means-clustering.
§Design target
It’s main target is high performance / throughput, you will therefore find most of its API-surface rather plain. An example of this is, that samples are given using a raw vector, instead of any high-level arithmetics / matrix crate such as nalgebra or ndarray. For this same reason, the crate is internally using hand-written SIMD operations and (unfortunately) quite a bit of unsafe code, which proved to lead to insane speedups.
§Supported variants
K-Means clustering is not one algorithm, but more like a concept describing the outcome.
There are differing implementations and variants using differing levels of approximations.
For a list of supported variants, have a look at the documentation of KMeans
.
§Supported centroid initializations
The outcome of each K-Means run depends on the initialization of its clusters. There exist
multiple algorithms for this initialization, most of which are based on at least some
degree of randomness. For a list of implemented initialization methods, see KMeans
.
§Supported primitive types
§Example
Both: Supported variants and supported centroid initializations can be combined at will. Here is an example showing the default full k-Means implementation, using K-Mean++ initialization:
use kmeans::*;
let (sample_cnt, sample_dims, k, max_iter) = (20000, 200, 4, 100);
// Generate some random data
let mut samples = vec![0.0f64;sample_cnt * sample_dims];
samples.iter_mut().for_each(|v| *v = rand::random());
// Calculate kmeans, using kmean++ as initialization-method
// KMeans<_, 8> specifies to use f64 SIMD vectors with 8 lanes (e.g. AVX512)
let kmean: KMeans<_, 8, _> = KMeans::new(samples, sample_cnt, sample_dims, EuclideanDistance);
let result = kmean.kmeans_lloyd(k, max_iter, KMeans::init_kmeanplusplus, &KMeansConfig::default());
println!("Centroids: {:?}", result.centroids);
println!("Cluster-Assignments: {:?}", result.assignments);
println!("Error: {}", result.distsum);
§Example (using the status event callbacks)
use kmeans::*;
let (sample_cnt, sample_dims, k, max_iter) = (20000, 200, 4, 2500);
// Generate some random data
let mut samples = vec![0.0f64;sample_cnt * sample_dims];
samples.iter_mut().for_each(|v| *v = rand::random());
let conf = KMeansConfig::build()
.init_done(&|_| println!("Initialization completed."))
.iteration_done(&|s, nr, new_distsum|
println!("Iteration {} - Error: {:.2} -> {:.2} | Improvement: {:.2}",
nr, s.distsum, new_distsum, s.distsum - new_distsum))
.build();
// Calculate kmeans, using kmean++ as initialization-method
// KMeans<_, 8> specifies to use f64 SIMD vectors with 8 lanes (e.g. AVX512)
let kmean: KMeans<_, 8, _> = KMeans::new(samples, sample_cnt, sample_dims, EuclideanDistance);
let result = kmean.kmeans_minibatch(4, k, max_iter, KMeans::init_random_sample, &conf);
println!("Centroids: {:?}", result.centroids);
println!("Cluster-Assignments: {:?}", result.assignments);
println!("Error: {}", result.distsum);
§Short API-Overview / Description
Entry-point of the library is the KMeans
struct. This struct is generic over the underlying primitive
type, that should be used for the calculations. To use KMeans, an instance of this struct is created, taking
over the sample data into its ownership (and doing some memory-related optimizations).
Note: The input-data has to use the same primitive as the required output-data (distances).
The KMeans
struct’s instance-methods represent the supported k-Means variants & implementations.
Calling such a method (e.g. KMeans::kmeans_lloyd
) on the struct does not mutate it, so multiple runs can be
done in parallel (the algorithm itself is already parallellized though). Internally, a new instance of
KMeansState
is used to store the state (and finally the result) of a K-Means calculation.
All of the instance-methods take multiple arguments. One of which is the chosen centroid initialization method. These
initialization-method implementations are static methods within the KMeans
struct, which are simply passed in as reference.
Structs§
- Euclidean
Distance - Histogram
Distance - KMeans
- Entrypoint of this crate’s API-Surface.
- KMeans
Config - This is a structure holding various configuration options for the a k-means calculations, such as the random number generator to use, or a couple of callbacks, that can be set to get status information from a running k-means calculation.
- KMeans
Config Builder - KMeans
State - This is the internally used data-structure, storing the current state during calculation, as
well as the final result, as returned by the API.
All mutations are done in this structure, making
KMeans
immutable, and therefore allowing it to be used in parallel, without having to duplicate the input-data.
Enums§
- Abort
Strategy - Enum with possible abort strategies. These strategies specify when a running iteration (with the k-means calculation) is aborted.
Traits§
- Distance
Function - A trait representing a customizable distance function for k-means clustering.
- Primitive