Multithreaded String Clustering
This crate provides a scalable string clustering implementation.
Strings are aggregated into clusters based on pairwise Levenshtein distance. If the distance is below a set fraction of the shorter string's length, the strings are added to the same cluster.
Multithreading model
- The input strings are evenly paritioned across the set of allocated threads.
- Once each thread has clustered its associated input strings, result aggregation is started.
- Clusters are merged in pairs accross multiple threads in a manner that is similar to traversing a binary tree from the leaves up to the root. The root of the tree is the final clustering.
- Thus, if there are N threads allocated, there will be ceil(log2(N)) merge operations.
Optimisation
A key optimisation made to significantly improve the performance of the implementation is the use of transitivity in the clustering operation.
If string A is clustered with string B and string B clustered with string C, strings A and C will be clustered together. This optimisation often provides significant runtime performance benefits with negligible impact on clustering performance.
There are a number of instances in which this optimisation will result in poor clustering performance. As a result, if this property cannot be exploited on the desired input data, another implementation should be used.
Installation
Add this to your Cargo.toml
[]
= "0.1.1"
Examples
Basic usage:
# use Error;
#
#
Multiple threads:
# use Error;
#
#