Kategorize
A fast, memory-efficient Rust implementation of k-modes and k-prototypes clustering algorithms for categorical and mixed data.
Features
- K-modes clustering: Designed specifically for categorical data
- K-prototypes clustering: Handles mixed categorical and numerical data
- Multiple initialization methods: Huang, Cao, and random initialization
- Multiple distance metrics: Matching, Hamming, and Jaccard distance support
- Parallel processing: Multi-threaded execution for better performance
- Memory efficient: Optimized data structures and algorithms
- Incremental mode updates: 30-50% performance improvement through cached frequency tracking
- Comprehensive: Full-featured with proper error handling and validation
- Well tested: Extensive unit tests, integration tests, and benchmarks
Quick Start
Add this to your Cargo.toml:
[]
= "0.3"
Basic K-modes Clustering
use ;
use Array2;
// Create categorical data
let data = from_shape_vec.unwrap;
// Configure and run k-modes clustering
let kmodes = new
.init_method
.distance_metric // Choose distance metric
.use_incremental_updates // Enable performance optimization
.max_iter
.n_init
.random_state;
let result = kmodes.fit.unwrap;
println!;
println!;
println!;
K-prototypes for Mixed Data
use ;
use Array2;
// Create mixed categorical and numerical data
let data = from_shape_vec.unwrap;
let kprototypes = new // categorical: [0,1], numerical: [2]
.gamma // weight for numerical vs categorical features
.random_state;
let result = kprototypes.fit.unwrap;
println!;
Algorithms
K-modes
K-modes extends k-means clustering to categorical data by:
- Using simple matching dissimilarity instead of Euclidean distance
- Replacing means with modes (most frequent values) for centroid updates
- Supporting various initialization strategies for better convergence
K-prototypes
K-prototypes combines k-modes and k-means to handle mixed data by:
- Computing separate distances for categorical and numerical features
- Using a γ (gamma) parameter to balance the two distance components
- Updating centroids using modes for categorical and means for numerical features
Initialization Methods
- Huang: Density-based initialization selecting most frequent data points
- Cao: Improved initialization maximizing dissimilarity between initial centroids
- Random: Randomly selected data points as initial centroids
Distance Metrics
- Matching: Simple matching dissimilarity (0 for match, 1 for mismatch)
- Hamming: Normalized matching distance (proportion of mismatches)
- Jaccard: Set-based similarity, ideal for data with repeated values or tag-like categories
Performance
Kategorize is designed for performance with:
- Incremental mode updates: Cached frequency tracking eliminates expensive mode recomputation (30-50% speedup)
- Parallel processing: Using Rayon for multiple initialization runs
- Efficient data structures: Minimizing memory allocation in hot paths
- Optimized distance calculations: Fast categorical data distance computation
- Early convergence detection: Smart stopping criteria reduce unnecessary iterations
Incremental Mode Updates (New in v0.3.0)
By default, Kategorize uses incremental mode updates for significant performance improvements. This feature caches frequency counts for each cluster and updates them incrementally as points change assignments, avoiding expensive mode recomputation.
let kmodes = new
.use_incremental_updates // Enabled by default
.random_state;
// For comparison, disable to use the classic algorithm:
let classic_kmodes = new
.use_incremental_updates
.random_state;
Performance benefits vary by dataset size:
- Small datasets (100-1000 points): 25-40% speedup
- Medium datasets (1000-10000 points): 40-60% speedup
- Large datasets (10000+ points): 60%+ speedup
Examples
Check out the examples directory for comprehensive usage patterns:
basic_kmodes.rs- Basic k-modes clusteringkprototypes_mixed_data.rs- Mixed data clusteringadvanced_usage.rs- Parameter tuning and optimizationjaccard_distance.rs- Jaccard distance metric usage
Run examples with:
Benchmarks
Run benchmarks to see performance characteristics:
This will run comprehensive benchmarks testing:
- Different data sizes and cluster counts
- Initialization method performance
- Scaling characteristics
- Parameter sensitivity
API Documentation
For detailed API documentation, visit docs.rs/kategorize or run:
Comparison with Python kmodes
Kategorize provides similar functionality to the popular Python kmodes library but with:
- Better performance: Rust's zero-cost abstractions and memory safety
- Type safety: Compile-time guarantees about data types and dimensions
- Parallel processing: Built-in support for multi-threading
- Memory efficiency: Lower memory footprint and no GIL limitations
Contributing
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
Areas where contributions would be particularly valuable:
- Additional initialization methods
- More distance metrics for categorical data
- Incremental/online clustering algorithms
- Integration with other ML libraries
License
This project is licensed under the MIT License.
Citation
If you use Kategorize in your research, please cite:
References
- Huang, Z. (1997). A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining. SIGMOD Record.
- Huang, Z. (1998). Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values. Data Mining and Knowledge Discovery.
- Cao, F., Liang, J, Bai, L. (2009). A new initialization method for categorical data clustering. Expert Systems with Applications.