1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
// kbo: Spectral Burrows-Wheeler transform accelerated local alignment search
//
// Copyright 2024 Tommi Mäklin [tommi@maklin.fi].
// Copyrights in this project are retained by contributors. No copyright assignment
// is required to contribute to this project.
// Except as otherwise noted (below and/or in individual files), this
// project is licensed under the Apache License, Version 2.0
// <LICENSE-APACHE> or <http://www.apache.org/licenses/LICENSE-2.0> or
// the MIT license, <LICENSE-MIT> or <http://opensource.org/licenses/MIT>,
// at your option.
//
//! Derandomizing noisy _k_-bounded matching statistics.
use embed_doc_image;
/// Evaluates the CDF of _k_-bounded matching statistics random match distribution.
///
/// Computes the log-probability that a matching statistic with value
/// `t` or less that is the result of mapping a _k_-mer with
/// `alphabet_size` possible characters against an index containing
/// `n_kmers` _k_-mers was generated by chance.
///
/// This probability is given by the cumulative distribution function (CDF):
///
/// ![Formula: (1 - q^{t + 1})^n][latex_rmp],
///
/// where ![q = 1/s][latex_1os] with _s_ possible characters
/// `alphabet_size` and _n_ _k_-mers in the index
/// `n_kmers`. Derivation is given below.
///
/// # Distribution of prefix match lengths in random strings
///
/// Let _X_ be a random variable denoting the length _t_ of the
/// longest common prefix of two uniformly random and infinitely long
/// strings. If ![0 < p < 1][latex_p_bounds] is the probability that any pair of two
/// characters in the two strings are mismatched, then _X_ follows the
/// [geometric
/// distribution](https://en.wikipedia.org/wiki/Geometric_distribution)
/// with cumulative distribution function
///
/// ![P(X less or equal to t) = 1 - (1 - p)^{t + 1} with t a natural number][latex_geom_dist_cdf]
///
/// Suppose that an index contains _n_ uniformly random strings of
/// length _k_. Assuming that _k_ is large enough that the probability
/// that two strings from the index match by chance is neglible, the
/// longest common prefix of these two strings is reasonably
/// approximated by the distribution of the random variable _X_.
///
/// Now, let _M_ be a random variable denoting the length of
/// the longest common prefix between some string of length
/// _k_ and the entire index. Since the longest common
/// prefix of two pairs of strings is given by _X_,
/// _M_ is the maximum of _n_ independent random
/// variables with the same distribution as _X_ given by ![M
/// = max{X_1,...,X_n}][latex_M_max]. Because the variables
/// _X_ were assumed independent, the cumulative
/// distribution function of _M_ is their product
///
/// ![P(M less or equal to t) = P(X less or equal to t)^n = (1 - (1 - p)^{t + 1})^n][latex_M_cdf_1]
///
/// By noting that ![p = 1 - q][latex_p_to_q], we get the cumulative distribution function
///
/// ![P(M less or equal to t) = (1 - q^{t + 1})^n][latex_M_cdf_2]
///
/// Credit to [Jarno N. Alanko](https://jnalanko.net/) for deriving the random match distribution.
///
/// # Examples
/// ```rust
/// # use assert_approx_eq::assert_approx_eq;
/// use kbo::derandomize::log_rm_max_cdf;
///
/// let alphabet_size = 4;
/// let n_kmers = 20240921;
///
/// let res = log_rm_max_cdf(10, alphabet_size, n_kmers);
/// // `res` is -4.825812199808644
/// # assert_approx_eq!(res, -4.825812199808644, 1e-8);
/// ```
///
/// Determines a lower bound for non-random _k_-bounded matching statistic values.
///
/// Computes the probabilities that the possible values for the
/// _k_-bounded matching statistics (MS) of a _k_-mer with size `k`
/// mapped against an index with `n_kmers` total _k_-mers and
/// `alphabet_size` possible values at each character are random
/// matches. Computation terminates when the MS value that produces a
/// random match probability below `max_error_prob` is found and
/// returned.
///
/// If no MS value passes the check, the function returns `k` instead.
///
/// # Examples
/// ```rust
/// use kbo::derandomize::random_match_threshold;
///
/// let k = 31;
/// let n_kmers = 20240921;
/// let alphabet_size = 4;
/// let max_error_prob = 0.01_f64;
///
/// let threshold = random_match_threshold(k, n_kmers, alphabet_size, max_error_prob);
/// // `threshold` is 15
/// # assert_eq!(threshold, 15);
/// ```
/// Derandomizes a single noisy _k_-bounded matching statistic.
///
/// Derandomizes the `current_noisy_ms` matching statistic (MS) based
/// on the `next_derand_ms` value obtained from the output of this
/// function for the next noisy MS when read left-to-right, the
/// _k_-mer size `k`, and the `threshold` which specifies a lower
/// bound to consider the MS a non-random match.
///
/// Positive values of the output i64 value mean that i64 characters
/// from the beginning of the k-mer match the reference, ie. same as
/// the MS, while negative values denote distance from the last
/// character in the last _k_-mer that produced a match.
///
/// # Examples
/// ## Noisy MS has only matches
/// ```rust
/// use kbo::derandomize::derandomize_ms_val;
///
/// // Parameters : k = 3, threshold = 2
/// //
/// // Noisy MS : 1,2,3,3,3
/// // Derandomized MS : 1,2,3,3,3
/// // Testing this pos : |
///
/// let derand_ms = derandomize_ms_val(3, 3, 2, 3);
/// // `derand_ms` is 3
/// # assert_eq!(derand_ms, 3);
/// ```
///
/// ## Noisy MS has only noise
/// ```rust
/// use kbo::derandomize::derandomize_ms_val;
///
/// // Parameters : k = 3, threshold = 2
/// //
/// // Noisy MS : 0, 0, 2, 1,0
/// // Derandomized MS : -4,-3,-2,-1,0
/// // Testing this pos : |
///
/// let derand_ms = derandomize_ms_val(2, -1, 2, 3);
/// // `derand_ms` is -2
/// # assert_eq!(derand_ms, -2);
/// ```
///
/// ## Noisy MS is at beginning of a full _k_-mer match
/// ```rust
/// use kbo::derandomize::derandomize_ms_val;
///
/// // Parameters : k = 3, threshold = 2
/// //
/// // Noisy MS : 1,2,3, 1,2
/// // Derandomized MS : 1,2,3,-1,0
/// // Testing this pos : |
///
/// let derand_ms = derandomize_ms_val(3, -1, 2, 3);
/// // `derand_ms` is 3
/// # assert_eq!(derand_ms, 3);
/// ```
///
/// ## Noisy MS is at beginning of a partial _k_-mer match
/// ```rust
/// use kbo::derandomize::derandomize_ms_val;
///
/// // Parameters : k = 4, threshold = 2
/// //
/// // Noisy MS : 1,2,3,-1,0,1,2,3,4,4
/// // Derandomized MS : 1,2,3,-1,0,1,2,3,4,4
/// // Testing this pos : |
///
/// let derand_ms = derandomize_ms_val(3, -1, 2, 4);
/// // `derand_ms` is 3
/// # assert_eq!(derand_ms, 3);
/// ```
///
/// Derandomizes a sequence of noisy _k_-bounded matching statistics.
///
/// Iterates over a sequence of noisy _k_-bounded matching statistics
/// `ms` in reverse to identify values that are the result of random
/// matching between _k_-mers of size `k` and an index that the lower
/// bound `threshold` was calculated for.
///
/// # Examples
/// ```rust
/// use kbo::derandomize::derandomize_ms_vec;
///
/// let k = 3;
/// let threshold = 2;
/// let noisy_ms = vec![1,2,2,3,2,2,3,2,1,2,3,1,1,1,2,3,1,2];
///
/// let derand_ms = derandomize_ms_vec(&noisy_ms, k, threshold);
/// // `derand_ms` has [0,1,2,3,1,2,3,0,1,2,3,-1,0,1,2,3,-1,0]
/// # assert_eq!(derand_ms, vec![0,1,2,3,1,2,3,0,1,2,3,-1,0,1,2,3,-1,0]);
/// ```
///
////////////////////////////////////////////////////////////////////////////////
// Tests
//