1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
//! # Sassy: fast approximate string matching
//!
//! Sassy is a library for searching approximate matches of short patterns/queries in longer texts.
//! It supports ASCII and DNA, and works best for patterns of length up to 1000.
//!
//! The main entrypoint is the [`Searcher`] object.
//! This can be configured with the alphabet ([`profiles::Ascii`], [`profiles::Dna`] (may panic on `N`), or [`profiles::Iupac`]),
//! whether to search the reverse complement ([`Searcher::new_fwd`], [`Searcher::new_rc`]),
//! and optionally with an _overhang cost_ for IUPAC profiles ([`Searcher::new_fwd_with_overhang`]).
//!
//! Given a [`Searcher`], you can search call [`Searcher::search`] with a pattern, text, and maximum edit distance `k`.
//! This will return a vector of [`Match`] objects, that each contain the substring of text they match, the corresponding `cost`,
//! and the `cigar` string that describes the alignment.
//!
//! ## `search` vs `search_all`
//!
//! Sassy faithfully implements 'Approximate String Matching' as defined by Navarro (2001):
//! it finds all positions in the text where an alignment of the pattern with cost <= `k` ends,
//! and returns those positions, with a traceback/alignment for each. This is covered by the [`Searcher::search_all`] mode.
//!
//! In practice, this can return a lot of overlapping matches. For example,
//! searching `ABC` in `XXXABCXXX` aligns with cost `333210123` after every
//! character. If we have `k=1`, `search_all` will return 3 matches with `AB` (cost 1), `ABC` (cost 0), and `ABCX` (cost 1).
//! [`Searcher::search`] only returns a match at the _rightmost_ position of each local minimum, which in this case is only `ABC`.
//!
//! See the paper (linked on [GitHub](https://github.com/RagnarGrootKoerkamp/sassy)) for details.
//!
//! ## Reverse complements
//!
//! When searching reverse complements, by default Sassy searches the pattern in both the forward and reverse-complement text.
//! Only Sassy2 ([`Searcher::search_encoded_patterns`]) instead searches the reverse complement of the pattern in the forward text.
//!
//! This has implications for the set of matches that is returned. By default,
//! there is at most 1 match per text _end_ position of each RC match, but in v2,
//! there is at most 1 match per text _start_ position of each RC match.
//!
//! ## Traceback
//!
//! Sassy uses a traceback that greedily prefers matches and substitutions, followed by deletions, followed by insertions.
//! Note that this does not guarantee to find an alignment with the minimal number of substitutions.
//!
//! In case more precise alignments are needed, e.g. via affine-cost alignments, it is recommended to run sassy with [`Searcher::without_trace`].
//! Then, each match only contains the end position in the text and pattern, and the user can manually run a backwards extension-alignment from that point.
//! In case of a reverse-complement match, in v1, the match in against `RC(text[text_start..])`, while in v2 (`search_encoded_patterns`),
//! the match is against `RC(pattern[pattern_start..])`.
//!
//! ## Example
//!
//! Usage example:
//! ```
//! use sassy::{
//! CachedRev, Match, Searcher, Strand,
//! profiles::{Dna, Iupac},
//! };
//!
//! // --- Test data ---
//! let pattern = b"ATCG";
//! let text = b"CCCATCACCC";
//! let k = 1;
//!
//! // --- FWD only search ---
//! /*
//! CCCATCACCC (text)
//! |||x
//! ATCG (pattern)
//! */
//! let mut searcher = Searcher::<Dna>::new_fwd();
//! let matches = searcher.search(pattern, &text, k);
//!
//! assert_eq!(matches.len(), 1);
//! let fwd_match = matches[0].clone();
//! assert_eq!(fwd_match.pattern_start, 0);
//! assert_eq!(fwd_match.pattern_end, 4);
//! assert_eq!(fwd_match.text_start, 3);
//! assert_eq!(fwd_match.text_end, 7);
//! assert_eq!(fwd_match.cost, 1);
//! assert_eq!(fwd_match.strand, Strand::Fwd);
//! assert_eq!(fwd_match.cigar.to_string(), "3=1X");
//!
//! // --- FWD + RC search ---
//! /*
//! CCCATCACCC (text) GGGTGATGGG (text - rc)
//! |||x ||x|
//! ATCG (pattern) ATCG
//! */
//! let mut searcher = Searcher::<Dna>::new_rc();
//! // We can cache the reverse text if we do multiple pattern
//! // searches in `rc` mode
//! let cached_text = CachedRev::new(text, true);
//! let matches = searcher.search(pattern, &cached_text, k);
//!
//! // Gives two matches, of which one the previous forward match
//! assert_eq!(matches.len(), 2);
//! assert_eq!(matches[0], fwd_match);
//! let rc_match = matches[1].clone();
//! assert_eq!(rc_match.pattern_start, 0);
//! assert_eq!(rc_match.pattern_end, 4);
//! assert_eq!(rc_match.text_start, 1);
//! assert_eq!(rc_match.text_end, 5);
//! assert_eq!(rc_match.cost, 1);
//! assert_eq!(rc_match.strand, Strand::Rc);
//! assert_eq!(rc_match.cigar.to_string(), "2=1X1=");
//!
//! // --- FWD + RC search with overhang ---
//! /*
//! GTXXXNNN (text)
//! || |||
//! ACGT ACGT (pattern)
//! */
//! let pattern = b"ACGT";
//! let text = b"GTXXXNNN";
//! let alpha = 0.5;
//! let k = 1;
//! let mut searcher = Searcher::<Iupac>::new_fwd_with_overhang(alpha);
//! let matches = searcher.search(pattern, &text, k);
//!
//! assert_eq!(matches[0].pattern_start, 2);
//! assert_eq!(matches[0].pattern_end, 4);
//! assert_eq!(matches[0].text_start, 0);
//! assert_eq!(matches[0].text_end, 2);
//! assert_eq!(matches[0].cost, 1);
//! assert_eq!(matches[0].strand, Strand::Fwd);
//! assert_eq!(matches[0].cigar.to_string(), "2=");
//!
//! assert_eq!(matches[1].pattern_start, 0);
//! assert_eq!(matches[1].pattern_end, 3);
//! assert_eq!(matches[1].text_start, 5);
//! assert_eq!(matches[1].text_end, 8);
//! assert_eq!(matches[1].cost, 0);
//! assert_eq!(matches[1].strand, Strand::Fwd);
//! assert_eq!(matches[1].cigar.to_string(), "3=");
//! ```
// INTERNAL MODS
// (PARTIALLY) PUBLIC MODS
pub use EncodedPatterns;
pub use CachedRev;
pub use Match;
pub use RcSearchAble;
pub use Searcher;
pub use Strand;
// BINDINGS
// TYPEDEFS
const LANES: usize = 4;
type S = u64x4;
const LANES: usize = 8;
type S = u64x8;
/// Print info on CPU features and speed of searching.
/// Print throughput in GB/s for random pattern and text.