Kategorize
A fast, memory-efficient Rust implementation of k-modes and k-prototypes clustering algorithms for categorical and mixed data.
Features
- K-modes clustering: Designed specifically for categorical data
- K-prototypes clustering: Handles mixed categorical and numerical data
- Multiple initialization methods: Huang, Cao, and random initialization
- Multiple distance metrics: Matching, Hamming, and Jaccard distance support
- Parallel processing: Multi-threaded execution for better performance
- Memory efficient: Optimized data structures and algorithms
- Comprehensive: Full-featured with proper error handling and validation
- Well tested: Extensive unit tests, integration tests, and benchmarks
Quick Start
Add this to your Cargo.toml:
[]
= "0.1"
Basic K-modes Clustering
use ;
use Array2;
// Create categorical data
let data = from_shape_vec.unwrap;
// Configure and run k-modes clustering
let kmodes = new
.init_method
.distance_metric // Choose distance metric
.max_iter
.n_init
.random_state;
let result = kmodes.fit.unwrap;
println!;
println!;
println!;
K-prototypes for Mixed Data
use ;
use Array2;
// Create mixed categorical and numerical data
let data = from_shape_vec.unwrap;
let kprototypes = new // categorical: [0,1], numerical: [2]
.gamma // weight for numerical vs categorical features
.random_state;
let result = kprototypes.fit.unwrap;
println!;
Algorithms
K-modes
K-modes extends k-means clustering to categorical data by:
- Using simple matching dissimilarity instead of Euclidean distance
- Replacing means with modes (most frequent values) for centroid updates
- Supporting various initialization strategies for better convergence
K-prototypes
K-prototypes combines k-modes and k-means to handle mixed data by:
- Computing separate distances for categorical and numerical features
- Using a γ (gamma) parameter to balance the two distance components
- Updating centroids using modes for categorical and means for numerical features
Initialization Methods
- Huang: Density-based initialization selecting most frequent data points
- Cao: Improved initialization maximizing dissimilarity between initial centroids
- Random: Randomly selected data points as initial centroids
Distance Metrics
- Matching: Simple matching dissimilarity (0 for match, 1 for mismatch)
- Hamming: Normalized matching distance (proportion of mismatches)
- Jaccard: Set-based similarity, ideal for data with repeated values or tag-like categories
Performance
Kategorize is designed for performance with:
- Parallel processing using Rayon for multiple initialization runs
- Efficient data structures minimizing memory allocation
- Optimized distance calculations for categorical data
- Early convergence detection
Examples
Check out the examples directory for comprehensive usage patterns:
basic_kmodes.rs- Basic k-modes clusteringkprototypes_mixed_data.rs- Mixed data clusteringadvanced_usage.rs- Parameter tuning and optimizationjaccard_distance.rs- Jaccard distance metric usage
Run examples with:
Benchmarks
Run benchmarks to see performance characteristics:
This will run comprehensive benchmarks testing:
- Different data sizes and cluster counts
- Initialization method performance
- Scaling characteristics
- Parameter sensitivity
API Documentation
For detailed API documentation, visit docs.rs/kategorize or run:
Comparison with Python kmodes
Kategorize provides similar functionality to the popular Python kmodes library but with:
- Better performance: Rust's zero-cost abstractions and memory safety
- Type safety: Compile-time guarantees about data types and dimensions
- Parallel processing: Built-in support for multi-threading
- Memory efficiency: Lower memory footprint and no GIL limitations
Contributing
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
Areas where contributions would be particularly valuable:
- Additional initialization methods
- More distance metrics for categorical data
- Incremental/online clustering algorithms
- Integration with other ML libraries
License
This project is licensed under the MIT License.
Citation
If you use Kategorize in your research, please cite:
References
- Huang, Z. (1997). A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining. SIGMOD Record.
- Huang, Z. (1998). Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values. Data Mining and Knowledge Discovery.
- Cao, F., Liang, J, Bai, L. (2009). A new initialization method for categorical data clustering. Expert Systems with Applications.