1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
//! # Multithreaded String Clustering
//! This crate provides a scalable string clustering implementation.
//!
//! Strings are aggregated into clusters based on pairwise Levenshtein
//! distance. If the distance is below a set fraction of the shorter string's
//! length, the strings are added to the same cluster.
//!
//! # Multithreading model
//! * The input strings are evenly paritioned across the set of allocated
//! threads.
//! * Once each thread has clustered its associated input strings, result aggregation is
//! started.
//! * Clusters are merged in pairs accross multiple threads in a manner that is similar to
//! traversing a binary tree from the leaves up to the root. The root of the tree is the final
//! clustering.
//! * Thus, if there are N threads allocated, there will be ceil(log2(N)) merge operations.
//!
//! # Optimisation
//! A key optimisation made to significantly improve the performance of the implementation is the use
//! of transitivity in the clustering operation.
//!
//! If string A is clustered with string B and string B clustered with string C, strings A and C
//! will be clustered together. This optimisation often provides significant runtime performance benefits with
//! negligible impact on clustering performance.
//!
//! There are a number of instances in which this optimisation will result in poor clustering performance.
//! As a result, if this property cannot be exploited on the desired input data, another implementation
//! should be used.
//!
//! # Installation
//!
//! Add this to your Cargo.toml
//!
//! ```toml
//! [dependencies]
//! clustr = "0.1.2"
//! ```
//!
//! # Examples
//! Basic usage:
//! ```
//! # use std::error::Error;
//! #
//! # fn main() -> Result<(), clustr::ValueError> {
//! let inputs = vec!["aaaa", "aaax", "bbbb", "bbbz"];
//! let expected = vec![vec!["aaaa", "aaax"], vec!["bbbb", "bbbz"]];
//!
//! let clusters = clustr::cluster_strings(&inputs, 0.25, 1)?;
//!
//! assert_eq!(clusters, expected);
//! #
//! # Ok(())
//! # }
//! ```
//!
//! # Multiple threads:
//! ```
//! # use std::error::Error;
//! #
//! # fn main() -> Result<(), clustr::ValueError> {
//! let inputs = vec!["aa", "bb", "aa", "bb"];
//! let expected = vec![vec!["aa", "aa"], vec!["bb", "bb"]];
//!
//! let results = clustr::cluster_strings(&inputs, 0.0, 4)?;
//!
//! // Order of returned clusters nondeterministic
//! for e in expected {
//! assert!(results.contains(&e));
//! }
//! #
//! # Ok(())
//! # }
//! ```
use aggregate_results;
use form_clusters;
/// Validation errors. Errors associated with invalid function argument values.
/// Group similar input strings into clusters.
///
/// Strings will be grouped into a cluster if the Levenshtein distance between the
/// strings is below 'max_edit_frac' of the shorter string's length.
///
/// # Examples
/// Basic usage:
/// ```
/// # use std::error::Error;
/// #
/// # fn main() -> Result<(), clustr::ValueError> {
/// let inputs = vec!["aaaa", "aaax", "bbbb", "bbbz"];
/// let expected = vec![vec!["aaaa", "aaax"], vec!["bbbb", "bbbz"]];
///
/// let clusters = clustr::cluster_strings(&inputs, 0.25, 1)?;
///
/// assert_eq!(clusters, expected);
/// #
/// # Ok(())
/// # }
/// ```
///
/// # Multiple threads:
/// ```
/// # use std::error::Error;
/// #
/// # fn main() -> Result<(), clustr::ValueError> {
/// let inputs = vec!["aa", "bb", "aa", "bb"];
/// let expected = vec![vec!["aa", "aa"], vec!["bb", "bb"]];
///
/// let results = clustr::cluster_strings(&inputs, 0.0, 4)?;
///
/// // Order of returned clusters nondeterministic
/// for e in expected {
/// assert!(results.contains(&e));
/// }
/// #
/// # Ok(())
/// # }
/// ```