sassy 0.1.7

Approximate string matching using SIMD
Documentation

crates.io docs.rs PyPI

Sassy: SIMD-accelerated Approximate String Matching

Sassy is a library and tool for searching short strings in texts, a problem that goes by many names:

  • approximate string matching,
  • pattern matching,
  • fuzzy searching.

The motivating application is searching short (length 20 to 100) DNA sequences in a human genome or e.g. in a set of reads. Sassy generally works well for patterns/queries up to length 1000, and supports both ASCII and DNA.

Highlights:

  • Sassy uses bitpacking and SIMD (both AVX2 and NEON supported). Its main novelty is tiling these in the text direction.
  • Support for overhang alignments where the pattern extends beyond the text.
  • Support for (case-insensitive) ASCII, DNA (ACGT), and IUPAC (=ACGT+NYR...) alphabets.
  • Rust library (cargo add sassy), binary (cargo install sassy), Python bindings (pip install sassy-rs), and C bindings (see below).

See the paper below, and corresponding evals in evals/.

Rick Beeloo and Ragnar Groot Koerkamp.
Sassy: Searching Short DNA Strings in the 2020s.
bioRxiv, July 2025. https://doi.org/10.1101/2025.07.22.666207.

Installation

Prebuilt binaries can be found in the latest release.

Building a binary with SIMD instructions

Sassy uses AVX2 and NEON instructions performance reasons. Unfortunately, by default cargo install will not use these instructions for portability reasons, even though your system is very likely to support them. Thus, you will need to manually instruct cargo to use the instruction sets available on your architecture:

RUSTFLAGS="-C target-cpu=native" cargo install sassy

Alternatively, enable the -F scalar feature flag to fall back to a scalar implementation with reduced performance:

cargo install sassy -F scalar

When using the sassy library in a larger project, the same restrictions apply: you will either need to build/compile the final binary with target-cpu=native, or pass the scalar feature to the sassy dependency.

Rust version

Sassy uses some recently stabilized Rust features, and so you will need at least 1.91, typically via rustup update. If your system-wide Rust installation is older, consider switching to rustup: https://rustup.rs/.

Usage

Sassy can be used via the CLI, or as Rust, Python, or C library.

0. Rust library

The library can be used to search for ASCII or DNA strings. A larger example can be found in src/lib.rs.

// cargo add sassy
use sassy::{Searcher, Match, profiles::{Dna}, Strand};

let pattern = b"ATCG";
let text = b"AAAATTGAAA";
let k = 1;

let mut searcher = Searcher::<Dna>::new_fwd();
let matches = searcher.search(pattern, &text, k);

assert_eq!(matches.len(), 1);

assert_eq!(matches[0].text_start, 3);
assert_eq!(matches[0].text_end, 7);
assert_eq!(matches[0].cost, 1);
assert_eq!(matches[0].strand, Strand::Fwd);
assert_eq!(matches[0].cigar.to_string(), "2=1X1=");

1. Command-line interface (CLI)

The CLI can be used via:

  1. sassy grep: to show nicely coloured output.
  2. sassy search: to write a .tsv of matching locations.
  3. sassy filter: to write a .fasta/.fastq of (non)-matching records.
  4. sassy crispr: to search for CRISPR guides.

grep, search, and filter all take the same arguments, and are implemented by forwarding to grep. Thus, they can all be combined via e.g.

sassy grep -p ACGTCAAACCTA -k 3 --matches matches.tsv --output filtered.fastq reads.fastq.gz

1.1: Grep for a pattern

Search a pattern ATGAGCA in text.fasta with ≤1 edit:

sassy search --pattern ATGAGCA -k 1 text.fasta

or search all records of a fasta file with --pattern-fasta <fasta-file> instead of --pattern.

The grep output is coloured:

  • green shows matching characters,
  • orange shows mismatches,
  • red shows deleted characters (in pattern but not in text),
  • blue shows inserted characters (in text but not in pattern). screenshot of sassy grep output

1.2: TSV output for matches

sassy search -p GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG -k 2 reads.fa > matches.tsv
# or
sassy search -p GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG -k 2 reads.fa --matches matches.tsv
# or
sassy grep   -p GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG -k 2 reads.fa --matches matches.tsv

gives .tsv output like this:

pat_id	text_id	cost	strand	start	end	match_region	cigar
pattern	AC_000001.1__1_1	0	+	6	48	GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG	42=
pattern	AC_000001.1__1_35	0	+	897	939	GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG	42=
pattern	AC_000001.1__1_49	1	+	866	908	GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCGCGCG	37=1X4=
pattern	AC_000001.1__1_64	0	-	1267	1309	GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG	42=
pattern	AC_000001.1__1_67	0	+	600	642	GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG	42=
pattern	AC_000001.1__1_68	0	-	1826	1868	GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG	42=
pattern	AC_000001.1__1_78	3	-	4381	4425	GTACAGAAACGAGCGGATGGAAAATAGTAGTGAGCGGCCTCGCG	23=1X1I10=1I8=
pattern	AC_000001.1__1_92	0	-	6554	6596	GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG	42=
pattern	AC_000001.1__1_94	0	-	6413	6455	GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG	42=
pattern	AC_000001.1__1_115	2	+	2091	2131	GTACAGAAACGAGCATGGAAAGAGTAGTGAGCGCCTCGCG	14=2D26=
pattern	AC_000001.1__1_118	0	-	3062	3104	GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG	42=
pattern	AC_000001.1__1_123	0	+	1416	1458	GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG	42=
pattern	AC_000001.1__1_127	0	+	27	69	GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG	42=

1.3: Filter matching records

sassy filter -p GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG -k 2 reads.fq > filtered.fq
# or
sassy filter -p GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG -k 2 reads.fq -o filtered.fq
# or
sassy grep   -p GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG -k 2 reads.fq -o filtered.fq

Writes a file containing only matching records. Use --invert to only write non-matching records.

1.4: CRISPR off-target search

Search for one or more guides in guides.txt:

sassy crispr --threads 8 --guide guides.txt --k 5 --max-n-frac 0.1 --output hits.tsv hg38.fasta

Allows <= k edits in the sgRNA, and the PAM (the last 3 characters of each guide) has to match exactly, unless --allow-pam-edits is given.

Output of the crispr command is a tab-delimited file with one row per hit, e.g.:

guide                    text_id  cost  strand  start     end       match_region             cigar
GAGTCCGAGCAGAAGAAGAANGG  chr21    5     +       5024135   5024154   GAGGCCACAGAGAAGAGGG      3=1X2=1D1=1D3=1D5=1D4=
GAGTCCGAGCAGAAGAAGAANGG  chr21    3     +       21087337  21087359  gagaccgaggagaagaaaaagg   3=1X5=1X7=1D5=
GAGTCCGAGCAGAAGAAGAANGG  chr21    3     -       9701297   9701320   GACTCGAGCATGAAGAAGAAAGG  2=1X1=1D6=1I12=
GAGTCCGAGCAGAAGAAGAANGG  chr21    5     -       46396975  46396998  CAGTCCCAGCAGACGACGGACGG  1X5=1X6=1X2=1X1=1X4=

The start and end are 0-based open-ended (i.e. 0-based inclusive of the start, but exclusive of the end), and start is always less than end (regardless of the strand). The match_region reported will be the sequence from the target file when strand is +, or the reverse complement of the sequence from the target file when strand is -, so that it matches the guide sequence. The cigar is always oriented to read left-to-right with the provided guide and match_region sequences.

Note that this searches for approximate occurrences of the guide sequence itself, and not for reverse-complement binding sites. If binding sites are to be found, please reverse-complement the input or output manually.

2. Python bindings

PyPI wheels can be installed with:

pip install sassy-rs 
import sassy

pattern = b"ACTG"
text    = b"ACGGCTACGCAGCATCATCAGCAT"

searcher = sassy.Searcher("dna") # ascii / dna / iupac
matches  = searcher.search(pattern, text, k=1)

for m in matches:
    print(m)

See python/README.md for more details.

3. C library

See c/README.md for details. Quick example:

#include "sassy.h"

int main() {
    const char* pattern = "ACTG";
    const char* text    = "ACGGCTACGCAGCATCATCAGCAT";

    // DNA alphabet, with reverse complement, without overhang.
    sassy_SearcherType* searcher = sassy_searcher("dna", true, NAN);
    sassy_Match* out_matches = NULL;
    size_t n_matches = search(searcher,
                              pattern, strlen(pattern),
                              text, strlen(text),
                              1, // k=1
                              &out_matches);

    sassy_matches_free(out_matches, n_matches);
    sassy_searcher_free(searcher);
}