Expand description
§Sassy: fast approximate string matching
Sassy is a library for searching approximate matches of short patterns/queries in longer texts. It supports ASCII and DNA, and works best for patterns of length up to 1000.
The main entrypoint is the Searcher object.
This can be configured with the alphabet (profiles::Ascii, profiles::Dna (may panic on N), or profiles::Iupac),
whether to search the reverse complement (Searcher::new_fwd, Searcher::new_rc),
and optionally with an overhang cost for IUPAC profiles (Searcher::new_fwd_with_overhang).
Given a Searcher, you can search call Searcher::search with a pattern, text, and maximum edit distance k.
This will return a vector of Match objects, that each contain the substring of text they match, the corresponding cost,
and the cigar string that describes the alignment.
§search vs search_all
Sassy faithfully implements ‘Approximate String Matching’ as defined by Navarro (2001):
it finds all positions in the text where an alignment of the pattern with cost <= k ends,
and returns those positions, with a traceback/alignment for each. This is covered by the Searcher::search_all mode.
In practice, this can return a lot of overlapping matches. For example,
searching ABC in XXXABCXXX aligns with cost 333210123 after every
character. If we have k=1, search_all will return 3 matches with AB (cost 1), ABC (cost 0), and ABCX (cost 1).
Searcher::search only returns a match at the rightmost position of each local minimum, which in this case is only ABC.
See the paper (linked on GitHub) for details.
§Reverse complements
When searching reverse complements, by default Sassy searches the pattern in both the forward and reverse-complement text.
Only Sassy2 (Searcher::search_encoded_patterns) instead searches the reverse complement of the pattern in the forward text.
This has implications for the set of matches that is returned. By default, there is at most 1 match per text end position of each RC match, but in v2, there is at most 1 match per text start position of each RC match.
§Traceback
Sassy uses a traceback that greedily prefers matches and substitutions, followed by deletions, followed by insertions. Note that this does not guarantee to find an alignment with the minimal number of substitutions.
In case more precise alignments are needed, e.g. via affine-cost alignments, it is recommended to run sassy with Searcher::without_trace.
Then, each match only contains the end position in the text and pattern, and the user can manually run a backwards extension-alignment from that point.
In case of a reverse-complement match, in v1, the match in against RC(text[text_start..]), while in v2 (search_encoded_patterns),
the match is against RC(pattern[pattern_start..]).
§Example
Usage example:
use sassy::{
CachedRev, Match, Searcher, Strand,
profiles::{Dna, Iupac},
};
// --- Test data ---
let pattern = b"ATCG";
let text = b"CCCATCACCC";
let k = 1;
// --- FWD only search ---
/*
CCCATCACCC (text)
|||x
ATCG (pattern)
*/
let mut searcher = Searcher::<Dna>::new_fwd();
let matches = searcher.search(pattern, &text, k);
assert_eq!(matches.len(), 1);
let fwd_match = matches[0].clone();
assert_eq!(fwd_match.pattern_start, 0);
assert_eq!(fwd_match.pattern_end, 4);
assert_eq!(fwd_match.text_start, 3);
assert_eq!(fwd_match.text_end, 7);
assert_eq!(fwd_match.cost, 1);
assert_eq!(fwd_match.strand, Strand::Fwd);
assert_eq!(fwd_match.cigar.to_string(), "3=1X");
// --- FWD + RC search ---
/*
CCCATCACCC (text) GGGTGATGGG (text - rc)
|||x ||x|
ATCG (pattern) ATCG
*/
let mut searcher = Searcher::<Dna>::new_rc();
// We can cache the reverse text if we do multiple pattern
// searches in `rc` mode
let cached_text = CachedRev::new(text, true);
let matches = searcher.search(pattern, &cached_text, k);
// Gives two matches, of which one the previous forward match
assert_eq!(matches.len(), 2);
assert_eq!(matches[0], fwd_match);
let rc_match = matches[1].clone();
assert_eq!(rc_match.pattern_start, 0);
assert_eq!(rc_match.pattern_end, 4);
assert_eq!(rc_match.text_start, 1);
assert_eq!(rc_match.text_end, 5);
assert_eq!(rc_match.cost, 1);
assert_eq!(rc_match.strand, Strand::Rc);
assert_eq!(rc_match.cigar.to_string(), "2=1X1=");
// --- FWD + RC search with overhang ---
/*
GTXXXNNN (text)
|| |||
ACGT ACGT (pattern)
*/
let pattern = b"ACGT";
let text = b"GTXXXNNN";
let alpha = 0.5;
let k = 1;
let mut searcher = Searcher::<Iupac>::new_fwd_with_overhang(alpha);
let matches = searcher.search(pattern, &text, k);
assert_eq!(matches[0].pattern_start, 2);
assert_eq!(matches[0].pattern_end, 4);
assert_eq!(matches[0].text_start, 0);
assert_eq!(matches[0].text_end, 2);
assert_eq!(matches[0].cost, 1);
assert_eq!(matches[0].strand, Strand::Fwd);
assert_eq!(matches[0].cigar.to_string(), "2=");
assert_eq!(matches[1].pattern_start, 0);
assert_eq!(matches[1].pattern_end, 3);
assert_eq!(matches[1].text_start, 5);
assert_eq!(matches[1].text_end, 8);
assert_eq!(matches[1].cost, 0);
assert_eq!(matches[1].strand, Strand::Fwd);
assert_eq!(matches[1].cigar.to_string(), "3=");Modules§
Structs§
- Cached
Rev - Struct that computes the reverse once on construction.
- Match
- A match of the pattern against the text.
- Searcher
- The main entry point for searching.
Enums§
- Encoded
Patterns - Strand
- Strand of a match. If Rc, pattern matches the reverse complement text.
Traits§
- RcSearch
Able - A trait for sequences that can cache their reverse.
Functions§
- test_
cpu_ features - Print info on CPU features and speed of searching.
- test_
throughput - Print throughput in GB/s for random pattern and text.