Kategorize

A fast, memory-efficient Rust implementation of k-modes and k-prototypes clustering algorithms for categorical and mixed data.

Features

K-modes clustering: Designed specifically for categorical data
K-prototypes clustering: Handles mixed categorical and numerical data
Multiple initialization methods: Huang, Cao, and random initialization
Multiple distance metrics: Matching, Hamming, and Jaccard distance support
Parallel processing: Multi-threaded execution for better performance
Memory efficient: Optimized data structures and algorithms
Incremental mode updates: 30-50% performance improvement through cached frequency tracking
Comprehensive: Full-featured with proper error handling and validation
Well tested: Extensive unit tests, integration tests, and benchmarks

Quick Start

Add this to your Cargo.toml:

[dependencies]
kategorize = "0.3"

Basic K-modes Clustering

use kategorize::{KModes, InitMethod, DistanceMetric};
use ndarray::Array2;

// Create categorical data
let data = Array2::from_shape_vec((6, 2), vec![
    "A", "X", "A", "X", "B", "Y", 
    "B", "Y", "C", "Z", "C", "Z"
]).unwrap();

// Configure and run k-modes clustering
let kmodes = KModes::new(3)
    .init_method(InitMethod::Huang)
    .distance_metric(DistanceMetric::Jaccard)  // Choose distance metric
    .use_incremental_updates(true)             // Enable performance optimization
    .max_iter(100)
    .n_init(10)
    .random_state(42);

let result = kmodes.fit(data.view()).unwrap();

println!("Cluster assignments: {:?}", result.labels);
println!("Centroids: {:?}", result.centroids);
println!("Converged: {}", result.converged);

K-prototypes for Mixed Data

use kategorize::{KPrototypes, MixedValue};
use ndarray::Array2;

// Create mixed categorical and numerical data
let data = Array2::from_shape_vec((4, 3), vec![
    MixedValue::Categorical("A"), MixedValue::Categorical("X"), MixedValue::Numerical(1.0),
    MixedValue::Categorical("A"), MixedValue::Categorical("X"), MixedValue::Numerical(2.0),
    MixedValue::Categorical("B"), MixedValue::Categorical("Y"), MixedValue::Numerical(10.0),
    MixedValue::Categorical("B"), MixedValue::Categorical("Y"), MixedValue::Numerical(11.0),
]).unwrap();

let kprototypes = KPrototypes::new(2, vec![0, 1], vec![2])  // categorical: [0,1], numerical: [2]
    .gamma(1.0)  // weight for numerical vs categorical features
    .random_state(42);

let result = kprototypes.fit(data.view(), vec![0, 1], vec![2]).unwrap();
println!("Mixed data clustering result: {:?}", result.labels);

Algorithms

K-modes

K-modes extends k-means clustering to categorical data by:

Using simple matching dissimilarity instead of Euclidean distance
Replacing means with modes (most frequent values) for centroid updates
Supporting various initialization strategies for better convergence

K-prototypes

K-prototypes combines k-modes and k-means to handle mixed data by:

Computing separate distances for categorical and numerical features
Using a γ (gamma) parameter to balance the two distance components
Updating centroids using modes for categorical and means for numerical features

Initialization Methods

Huang: Density-based initialization selecting most frequent data points
Cao: Improved initialization maximizing dissimilarity between initial centroids
Random: Randomly selected data points as initial centroids

Distance Metrics

Matching: Simple matching dissimilarity (0 for match, 1 for mismatch)
Hamming: Normalized matching distance (proportion of mismatches)
Jaccard: Set-based similarity, ideal for data with repeated values or tag-like categories

Performance

Kategorize is designed for performance with:

Incremental mode updates: Cached frequency tracking eliminates expensive mode recomputation (30-50% speedup)
Parallel processing: Using Rayon for multiple initialization runs
Efficient data structures: Minimizing memory allocation in hot paths
Optimized distance calculations: Fast categorical data distance computation
Early convergence detection: Smart stopping criteria reduce unnecessary iterations

Incremental Mode Updates (New in v0.3.0)

By default, Kategorize uses incremental mode updates for significant performance improvements. This feature caches frequency counts for each cluster and updates them incrementally as points change assignments, avoiding expensive mode recomputation.

let kmodes = KModes::new(3)
    .use_incremental_updates(true)   // Enabled by default
    .random_state(42);

// For comparison, disable to use the classic algorithm:
let classic_kmodes = KModes::new(3)
    .use_incremental_updates(false)
    .random_state(42);

Performance benefits vary by dataset size:

Small datasets (100-1000 points): 25-40% speedup
Medium datasets (1000-10000 points): 40-60% speedup
Large datasets (10000+ points): 60%+ speedup

Examples

Check out the examples directory for comprehensive usage patterns:

basic_kmodes.rs - Basic k-modes clustering
kprototypes_mixed_data.rs - Mixed data clustering
advanced_usage.rs - Parameter tuning and optimization
jaccard_distance.rs - Jaccard distance metric usage

Run examples with:

cargo run --example basic_kmodes
cargo run --example kprototypes_mixed_data
cargo run --example advanced_usage
cargo run --example jaccard_distance

Benchmarks

Run benchmarks to see performance characteristics:

cargo bench

This will run comprehensive benchmarks testing:

Different data sizes and cluster counts
Initialization method performance
Scaling characteristics
Parameter sensitivity

API Documentation

For detailed API documentation, visit docs.rs/kategorize or run:

cargo doc --open

Comparison with Python kmodes

Kategorize provides similar functionality to the popular Python kmodes library but with:

Better performance: Rust's zero-cost abstractions and memory safety
Type safety: Compile-time guarantees about data types and dimensions
Parallel processing: Built-in support for multi-threading
Memory efficiency: Lower memory footprint and no GIL limitations

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Areas where contributions would be particularly valuable:

Additional initialization methods
More distance metrics for categorical data
Incremental/online clustering algorithms
Integration with other ML libraries

License

This project is licensed under the MIT License.

Citation

If you use Kategorize in your research, please cite:

@software{kategorize,
  title = {Kategorize: K-modes and K-prototypes clustering for Rust},
  author = {Wu, Ethan},
  year = {2024},
  url = {https://github.com/ethqnol/kategorize}
}

References

Huang, Z. (1997). A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining. SIGMOD Record.
Huang, Z. (1998). Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values. Data Mining and Knowledge Discovery.
Cao, F., Liang, J, Bai, L. (2009). A new initialization method for categorical data clustering. Expert Systems with Applications.

kategorize 0.3.0