Sassy: SIMD-accelerated Approximate String Matching
Sassy is a library and tool for searching short strings in texts, a problem that goes by many names:
- approximate string matching,
- pattern matching,
- fuzzy searching.
The motivating application is searching short (length 20 to 100) DNA sequences in a human genome or e.g. in a set of reads. Sassy generally works well for patterns/queries up to length 1000, and supports both ASCII and DNA.
Highlights:
- Sassy uses bitpacking and SIMD. Its main novelty is tiling these in the text direction.
- Support for overhang alignments where the pattern extends beyond the text.
- Support for (case-insensitive) ASCII, DNA (
ACGT), and IUPAC (=ACGT+NYR...) alphabets. - Rust library (
cargo add sassy), binary (cargo install sassy), Python bindings (pip install sassy-rs), and C bindings (see below).
See the paper below, and corresponding evals in evals/.
Rick Beeloo and Ragnar Groot Koerkamp.
Sassy: Searching Short DNA Strings in the 2020s.
bioRxiv, July 2025. https://doi.org/10.1101/2025.07.22.666207.
The main limitation is that currently AVX2 and BMI2 are required.
Usage
0. Rust library
A larger example can be found in src/lib.rs.
use ;
let pattern = b"ATCG";
let text = b"AAAATTGAAA";
let k = 1;
let mut searcher = new_fwd;
let matches = searcher.search;
assert_eq!;
assert_eq!;
assert_eq!;
assert_eq!;
assert_eq!;
assert_eq!;
1. Command-line interface (CLI)
Build and install using cargo:
Search a pattern ATGAGCA in text.fasta with ≤1 edit:
or search all records of a fasta file with --pattern-fasta <fasta-file> instead of --pattern.
For the alphabets see supported alphabets
CRISPR off-target search for one or more guides in guides.txt:
Allows <= k edits in the sgRNA, and the PAM (the last 3 characters of each guide) has to match exactly, unless --allow-pam-edits is given.
Output of the crispr command is a tab-delimited file with one row per hit, e.g.:
guide text_id cost strand start end match_region cigar
GAGTCCGAGCAGAAGAAGAANGG chr21 5 + 5024135 5024154 GAGGCCACAGAGAAGAGGG 3=1X2=1D1=1D3=1D5=1D4=
GAGTCCGAGCAGAAGAAGAANGG chr21 3 + 21087337 21087359 gagaccgaggagaagaaaaagg 3=1X5=1X7=1D5=
GAGTCCGAGCAGAAGAAGAANGG chr21 3 - 9701297 9701320 GACTCGAGCATGAAGAAGAAAGG 2=1X1=1D6=1I12=
GAGTCCGAGCAGAAGAAGAANGG chr21 5 - 46396975 46396998 CAGTCCCAGCAGACGACGGACGG 1X5=1X6=1X2=1X1=1X4=
The start and end are 0-based open-ended (i.e. 0-based inclusive of the
start, but exclusive of the end), and start is always less then end
(regardless of the strand). The
match_region reported will be the sequence from the target file when strand is +, or the reverse complement
of the sequence from the target file when strand is -, so that it matches the guide sequence.
The cigar is always oriented to read left-to-right with the provided guide and match_region sequences.
Note that this searches for approximate occurrences of the guide sequence itself, and not for reverse-complement binding sites. If binding sites are to be found, please reverse-complement the input or output manually.
2. Python bindings
PyPI wheels can be installed with:
= b
= b
= # ascii / dna / iupac
=
See python/README.md for more details.
3. C library
See c/README.md for details. Quick example:
int