Expand description
§Sassy: fast approximate string matching
Sassy is a library for searching approximate matches of short patterns/queries in longer texts. It supports ASCII and DNA, and works best for patterns of length up to 1000.
The main entrypoint is the Searcher
object.
This can be configured with the alphabet (profiles::Ascii
, profiles::Dna
, or profiles::Iupac
),
whether to search the reverse complement (Searcher::new_fwd
, Searcher::new_rc
),
and optionally with an overhang cost for IUPAC profiles (Searcher::new_fwd_with_overhang
).
Given a Searcher
, you can search call Searcher::search
with a pattern, text, and maximum edit distance k
.
This will return a vector of Match
objects, that each contain the substring of text they match, the corresponding cost
,
and the cigar
string that describes the alignment.
§search
vs search_all
By default, Searcher::search
will only return matches that end in a rightmost local minimum.
Searcher::search_all
, on the other hand, always returns matches in all end-positions with cost <= k
.
See the paper (linked on GitHub) for details.
§Example
Usage example:
use sassy::{
CachedRev, Match, Searcher, Strand,
profiles::{Dna, Iupac},
};
// --- Test data ---
let pattern = b"ATCG";
let text = b"CCCATCACCC";
let k = 1;
// --- FWD only search ---
/*
CCCATCACCC (text)
|||x
ATCG (pattern)
*/
let mut searcher = Searcher::<Dna>::new_fwd();
let matches = searcher.search(pattern, &text, k);
assert_eq!(matches.len(), 1);
let fwd_match = matches[0].clone();
assert_eq!(fwd_match.pattern_start, 0);
assert_eq!(fwd_match.pattern_end, 4);
assert_eq!(fwd_match.text_start, 3);
assert_eq!(fwd_match.text_end, 7);
assert_eq!(fwd_match.cost, 1);
assert_eq!(fwd_match.strand, Strand::Fwd);
assert_eq!(fwd_match.cigar.to_string(), "3=X");
// --- FWD + RC search ---
/*
CCCATCACCC (text) GGGTGATGGG (text - rc)
|||x ||x|
ATCG (pattern) ATCG
*/
let mut searcher = Searcher::<Dna>::new_rc();
// We can cache the reverse text if we do multiple pattern
// searches in `rc` mode
let cached_text = CachedRev::new(text, true);
let matches = searcher.search(pattern, &cached_text, k);
// Gives two matches, of which one the previous forward match
assert_eq!(matches.len(), 2);
assert_eq!(matches[0], fwd_match);
let rc_match = matches[1].clone();
assert_eq!(rc_match.pattern_start, 0);
assert_eq!(rc_match.pattern_end, 4);
assert_eq!(rc_match.text_start, 1);
assert_eq!(rc_match.text_end, 5);
assert_eq!(rc_match.cost, 1);
assert_eq!(rc_match.strand, Strand::Rc);
assert_eq!(rc_match.cigar.to_string(), "2=X=");
// --- FWD + RC search with overhang ---
/*
GTXXXNNN (text)
|| |||
ACGT ACGT (pattern)
*/
let pattern = b"ACGT";
let text = b"GTXXXNNN";
let alpha = 0.5;
let k = 1;
let mut searcher = Searcher::<Iupac>::new_fwd_with_overhang(alpha);
let matches = searcher.search(pattern, &text, k);
assert_eq!(matches[0].pattern_start, 2);
assert_eq!(matches[0].pattern_end, 4);
assert_eq!(matches[0].text_start, 0);
assert_eq!(matches[0].text_end, 2);
assert_eq!(matches[0].cost, 1);
assert_eq!(matches[0].strand, Strand::Fwd);
assert_eq!(matches[0].cigar.to_string(), "2=");
assert_eq!(matches[1].pattern_start, 0);
assert_eq!(matches[1].pattern_end, 3);
assert_eq!(matches[1].text_start, 5);
assert_eq!(matches[1].text_end, 8);
assert_eq!(matches[1].cost, 0);
assert_eq!(matches[1].strand, Strand::Fwd);
assert_eq!(matches[1].cigar.to_string(), "3=");
Modules§
Structs§
- Cached
Rev - Struct that computes the reverse once on construction.
- Match
- A match of the pattern against the text.
- Searcher
- The main entry point for searching.
Enums§
- Strand
- Strand of a match. If Rc, pattern matches the reverse complement text.
Traits§
- RcSearch
Able - A trait for sequences that can cache their reverse.
Functions§
- test_
cpu_ features - Print info on CPU features and speed of searching.
- test_
throughput - Print throughput in GB/s for random pattern and text.