Sassy: SIMD-accelerated Approximate String Matching
Sassy is a library and tool for searching short strings in texts, a problem that goes by many names:
- approximate string matching,
- pattern matching,
- fuzzy searching.
The motivating application is searching short (length 20 to 100) DNA sequences in a human genome or e.g. in a set of reads. Sassy generally works well for patterns/queries up to length 1000, and supports both ASCII and DNA.
Highlights:
- Sassy uses bitpacking and SIMD. Its main novelty is tiling these in the text direction.
- Support for overhang alignments where the pattern extends beyond the text.
- Support for (case-insensitive) ASCII, DNA (
ACGT), and IUPAC (=ACGT+NYR...) alphabets. - Rust library (
cargo add sassy), binary (cargo install sassy), Python bindings (pip install sassy-rs), and C bindings (see below).
The paper can be found here, and evals are in evals/.
The main limitation is that currently AVX2 and BMI2 are required.
Usage
0. Rust library
A larger example can be found in src/lib.rs.
use ;
let pattern = b"ATCG";
let text = b"AAAATTGAAA";
let k = 1;
let mut searcher = new_fwd;
let matches = searcher.search;
assert_eq!;
assert_eq!;
assert_eq!;
assert_eq!;
assert_eq!;
assert_eq!;
1. Command-line interface (CLI)
Build and install using cargo:
Search a pattern ATGAGCA in text.fasta with ≤1 edit:
or search all records of a fasta file with --pattern-fasta <fasta-file> instead of --pattern.
For the alphabets see supported alphabets
CRISPR off-target search for guides in guides.txt:
Allows <= k edits in the sgRNA, and the PAM has to match exactly, unless
--allow-pam-edits is given.
2. Python bindings
PyPI wheels can be installed with:
= b
= b
= # ascii / dna / iupac
=
See python/README.md for more details.
3. C library
See c/README.md for details. Quick example:
int