Crate sassy

Source
Expand description

§Sassy: fast approximate string matching

Sassy is a library for searching approximate matches of short patterns/queries in longer texts. It supports ASCII and DNA, and works best for patterns of length up to 1000.

The main entrypoint is the Searcher object. This can be configured with the alphabet (profiles::Ascii, profiles::Dna, or profiles::Iupac), whether to search the reverse complement (Searcher::new_fwd, Searcher::new_rc), and optionally with an overhang cost for IUPAC profiles (Searcher::new_fwd_with_overhang).

Given a Searcher, you can search call Searcher::search with a pattern, text, and maximum edit distance k. This will return a vector of Match objects, that each contain the substring of text they match, the corresponding cost, and the cigar string that describes the alignment.

§search vs search_all

By default, Searcher::search will only return matches that end in a rightmost local minimum. Searcher::search_all, on the other hand, always returns matches in all end-positions with cost <= k.

See the paper (linked on GitHub) for details.

§Example

Usage example:

use sassy::{
    CachedRev, Match, Searcher, Strand,
    profiles::{Dna, Iupac},
};

// --- Test data ---
let pattern = b"ATCG";
let text = b"CCCATCACCC";
let k = 1;

// --- FWD only search ---
/*
    CCCATCACCC (text)
       |||x
       ATCG    (pattern)
*/
let mut searcher = Searcher::<Dna>::new_fwd();
let matches = searcher.search(pattern, &text, k);

assert_eq!(matches.len(), 1);
let fwd_match = matches[0].clone();
assert_eq!(fwd_match.pattern_start, 0);
assert_eq!(fwd_match.pattern_end, 4);
assert_eq!(fwd_match.text_start, 3);
assert_eq!(fwd_match.text_end, 7);
assert_eq!(fwd_match.cost, 1);
assert_eq!(fwd_match.strand, Strand::Fwd);
assert_eq!(fwd_match.cigar.to_string(), "3=X");

// --- FWD + RC search ---
/*
    CCCATCACCC (text)       GGGTGATGGG (text - rc)
       |||x                      ||x|
       ATCG    (pattern)         ATCG
*/
let mut searcher = Searcher::<Dna>::new_rc();
// We can cache the reverse text if we do multiple pattern
// searches in `rc` mode
let cached_text = CachedRev::new(text, true);
let matches = searcher.search(pattern, &cached_text, k);

// Gives two matches, of which one the previous forward match
assert_eq!(matches.len(), 2);
assert_eq!(matches[0], fwd_match);
let rc_match = matches[1].clone();
assert_eq!(rc_match.pattern_start, 0);
assert_eq!(rc_match.pattern_end, 4);
assert_eq!(rc_match.text_start, 1);
assert_eq!(rc_match.text_end, 5);
assert_eq!(rc_match.cost, 1);
assert_eq!(rc_match.strand, Strand::Rc);
assert_eq!(rc_match.cigar.to_string(), "2=X=");

// --- FWD + RC search with overhang ---
/*
              GTXXXNNN     (text)
              ||   |||
            ACGT   ACGT    (pattern)
*/
let pattern = b"ACGT";
let text = b"GTXXXNNN";
let alpha = 0.5;
let k = 1;
let mut searcher = Searcher::<Iupac>::new_fwd_with_overhang(alpha);
let matches = searcher.search(pattern, &text, k);

assert_eq!(matches[0].pattern_start, 2);
assert_eq!(matches[0].pattern_end, 4);
assert_eq!(matches[0].text_start, 0);
assert_eq!(matches[0].text_end, 2);
assert_eq!(matches[0].cost, 1);
assert_eq!(matches[0].strand, Strand::Fwd);
assert_eq!(matches[0].cigar.to_string(), "2=");

assert_eq!(matches[1].pattern_start, 0);
assert_eq!(matches[1].pattern_end, 3);
assert_eq!(matches[1].text_start, 5);
assert_eq!(matches[1].text_end, 8);
assert_eq!(matches[1].cost, 0);
assert_eq!(matches[1].strand, Strand::Fwd);
assert_eq!(matches[1].cigar.to_string(), "3=");

Modules§

profiles

Structs§

CachedRev
Struct that computes the reverse once on construction.
Match
A match of the pattern against the text.
Searcher
The main entry point for searching.

Enums§

Strand
Strand of a match. If Rc, pattern matches the reverse complement text.

Traits§

RcSearchAble
A trait for sequences that can cache their reverse.

Functions§

test_cpu_features
Print info on CPU features and speed of searching.
test_throughput
Print throughput in GB/s for random pattern and text.