Installation
[]
= "0.1"
Features
| Feature | Platform | Description |
|---|---|---|
cuda |
NVIDIA GPU | Flash-accelerated CUDA with cuBLAS GEMM and warp-cooperative kernels |
metal_gpu |
macOS (Apple Silicon) | Metal Performance Shaders GPU acceleration |
accelerate |
macOS | Apple Accelerate BLAS for CPU |
mkl |
Linux (Intel/AMD) | Intel MKL for CPU (recommended for Linux, fastest) |
openblas |
Linux / Windows | OpenBLAS for CPU (requires libopenblas-dev) |
# NVIDIA GPU (recommended for Linux)
= { = "0.1", = ["cuda"] }
# Apple Silicon GPU
= { = "0.1", = ["metal_gpu", "accelerate"] }
# CPU-only with BLAS
= { = "0.1", = ["mkl"] } # Linux (fastest)
= { = "0.1", = ["accelerate"] } # macOS
= { = "0.1", = ["openblas"] } # Linux (fallback)
When cuda or metal_gpu is enabled, FastKMeans automatically uses the GPU. No code changes needed.
Usage
use ;
use Array2;
use RandomExt;
use Uniform;
// Generate data: 100K points, 128 dimensions
let data = random;
// Create model with 256 clusters
let config = new
.with_max_iters
.with_max_points_per_centroid;
let mut kmeans = with_config;
// Train
kmeans.train.unwrap;
// Predict
let labels = kmeans.predict.unwrap;
// Or fit + predict in one call
let labels = kmeans.fit_predict.unwrap;
// Access centroids
let centroids = kmeans.centroids.unwrap; // shape: (256, 128)
Benchmarks
All benchmarks run with 25 iterations.
fastkmeans-rs vs fast-kmeans vs flash-kmeans
Train 100K vectors, 128 dimensions, 25 iterations.

Compared against fast-kmeans and flash-kmeans (optimized Triton kernels). CUDA/GPU benchmarks on H100, Metal GPU on Apple Silicon. fastkmeans-rs is pure Rust with no Python dependency.
Acknowledgements
Based on fast-kmeans and flash-kmeans. Credit for the algorithm design goes to the original authors.
License
Apache-2.0 — see LICENSE.