skesa-rs 0.2.1

Rust port of NCBI's SKESA genome assembler
Documentation

skesa-rs

Pure Rust port of NCBI's SKESA (Strategic K-mer Extension for Scrupulous Assemblies) — a de-novo sequence read assembler for microbial genomes.

Based on SKESA v2.4.0 / SAUTE v1.3.0 (commit 27caba2, 2024-10-11)

(SAUTE is not yet covered by this translation)

  • 2026-05-02: Exposed internals better for library use
  • 2026-04-27: This crate has only passed a bare minimum of testing. Current local --cores 4 SKESA benchmarks are roughly on par with the original. Use at your own risk

Below is a blurb that we should add to all our crates; it is the latest version

This is an LLM-mediated faithful (hopefully) translation, not the original code!

Most users should probably first see if the existing original code works for them, unless they have reason otherwise. The original source may have newer features and it has had more love in terms of fixing bugs. In fact, we aim to replicate bugs if they are present, for the sake of reproducibility! (but then we might have added a few more in the process)

There are however cases when you might prefer this Rust version. We generally agree with this manifesto but more specifically:

  • We have had many issues with ensuring that our software works using existing containers (Docker, PodMan, Singularity). One size does not fit all and it eats our resources trying to keep up with every way of delivering software
  • Common package managers do not work well. It was great when we had a few Linux distributions with stable procedures, but now there are just too many ecosystems (Homebrew, Conda). Conda has an NP-complete resolver which does not scale. Homebrew is only so-stable. And our dependencies in Python still break. These can no longer be considered professional serious options. Meanwhile, Cargo enables multiple versions of packages to be available, even within the same program(!)
  • The future is the web. We deploy software in the web browser, and until now that has meant Javascript. This is a language where even the == operator is broken. Typescript is one step up, but a game changer is the ability to compile Rust code into webassembly, enabling performance and sharing of code with the backend. Translating code to Rust enables new ways of deployment and running code in the browser has especial benefits for science - researchers do not have deep pockets to run servers, so pushing compute to the user enables deployment that otherwise would be impossible
  • Old CLI-based utilities are bad for the environment(!). A large amount of compute resources are spent creating and communicating via small files, which we can bypass by using code as libraries. Even better, we can avoid frequent reloading of databases by hoisting this stage, with up to 100x speedups in some cases. Less compute means faster compute and less electricity wasted
  • LLM-mediated translations may actually be safer to use than the original code. This article shows that running the same code on different operating systems can give somewhat different answers. This is a gap that Rust+Cargo can reduce. Typesafe interfaces also reduce coding mistakes and error handling, as opposed to typical command-line scripting

But:

  • This approach should still be considered experimental. The LLM technology is immature and has sharp corners. But there are opportunities to reap, and the genie is not going back into the bottle. This translation is as much aimed to learn how to improve the technology and get feedback on the results.
  • Translations are not endorsed by the original authors unless otherwise noted. Do not send bug reports to the original developers. Use our Github issues page instead.
  • Do not trust the benchmarks on this page. They are used to help evaluate the translation. If you want improved performance, you generally have to use this code as a library, and use the additional tricks it offers. We generally accept performance losses in order to reduce our dependency issues
  • Check the original Github pages for information about the package. This README is kept sparse on purpose. It is not meant to be the primary source of information
  • If you are the author of the original code and wish to move to Rust, you can obtain ownership of this repository and crate. Until then, our commitment is to offer an as-faithful-as-possible translation of a snapshot of your code. If we find serious bugs, we will report them to you. Otherwise we will just replicate them, to ensure comparability across studies that claim to use package XYZ v.666. Think of this like a fancy Ubuntu .deb-package of your software - that is how we treat it

This blurb might be out of date. Go to this page for the latest information and further information about how we approach translation

Missing Features

  • saute, saute-prot, and gfa-connector full parity
  • SRA input is not supported in the Rust port; --sra_run / --sra-run are rejected with a clear error.

Building

# Library only (default; no clap dependency)
cargo build --release

# Library + CLI
cargo build --release --features cli

# Native CPU optimizations (recommended for benchmarking)
RUSTFLAGS="-C target-cpu=native" cargo build --release --features cli

Usage

K-mer counting

skesa-rs kmercounter --reads input.fasta --kmer 21 --text-out kmers.txt --hist histogram.txt

Assembly

# Basic assembly
skesa-rs skesa --reads input.fasta --contigs-out contigs.fasta

# With options
skesa-rs skesa --reads input.fasta --cores 4 --kmer 21 --min-contig 200 --contigs-out contigs.fasta

# GFA output
skesa-rs skesa --reads input.fasta --contigs-out contigs.fasta --gfa-out graph.gfa

Library usage

use skesa_rs::reads_getter::ReadsGetter;
use skesa_rs::sorted_counter;
use skesa_rs::graph_digger::{self, DiggerParams};

// Load reads from file
let rg = ReadsGetter::new(&["reads.fasta".to_string()], false).unwrap();

// Count k-mers
let mut kmers = sorted_counter::count_kmers_sorted(
    rg.reads(), 21, 2, true, 32,
);
sorted_counter::get_branches(&mut kmers, 21);

// Assemble contigs
let bins = sorted_counter::get_bins(&kmers);
let contigs = graph_digger::assemble_contigs(
    &mut kmers, &bins, 21, true, &DiggerParams::default(),
);

for contig in &contigs {
    println!("{}", contig.primary_sequence());
}

Testing

cargo test                          # Run unit and integration tests
cargo test kmer                     # Run k-mer related tests
cargo test assembler                # Run assembler tests
cargo bench --bench kmer_bench      # Run criterion benchmarks

For optional real-world coverage without vendoring public reads into the repo, download the external fixture(s) first and then run the ignored parity test:

scripts/download-real-world-fixtures.py \
  --dest /somewhere/external

SKESA_EXTERNAL_DATA_DIR=/somewhere/external \
TMPDIR=/somewhere/skesa \
cargo test --test integration_tests cpp_kmercounter_real_world_mgenitalium_hist_matches_rust -- --ignored

The initial external fixture is ENA run ERR486835, a small public paired-end Mycoplasmoides genitalium WGS dataset. The downloader trims the first 5k read pairs into subset_1.fastq / subset_2.fastq, and the ignored test currently compares Rust and bundled C++ kmercounter --hist output on subset_1.fastq. Larger external assembly parity fixtures should be added only after the remaining multi-step assembly gaps are closed.

Benchmarking

Use tools/benchmark_command.py for parity/performance measurements so Rust and bundled C++ commands record the same metadata. Put large temporary inputs, outputs, and JSON results under /somewhere when that path is writable in the current environment.

Current local diagnostic snapshot (SRR1703350_10x, paired FASTQ.gz, --cores 4, --kmer 21 --min-count 2 --vector-percent 1.0 --estimated-kmers 100 --min-contig 200, contigs-only output):

implementation wall time CPU time max RSS output SHA-256
Rust target/release/skesa-rs 1:23.80 211.13s 785,596 KB 37788987fae96213428bb0782826823c188c3cac58ab0c1903f8bbbe28893060
bundled C++ SKESA/skesa 1:41.16 225.57s 781,940 KB 37788987fae96213428bb0782826823c188c3cac58ab0c1903f8bbbe28893060

Treat this as a local diagnostic run, not a portable performance claim. The Rust and C++ output files were compared byte-for-byte for this snapshot.

python3 tools/benchmark_command.py \
  --output /somewhere/rust-small.json \
  --label rust-small --repeat 3 -- \
  target/release/skesa-rs skesa --reads tests/data/small_test.fasta --contigs-out /tmp/rust-contigs.fasta

python3 tools/benchmark_command.py \
  --output /somewhere/cpp-small.json \
  --label cpp-small --repeat 3 -- \
  SKESA/skesa --reads tests/data/small_test.fasta --contigs_out /tmp/cpp-contigs.fasta

Record the command, platform, wall time, CPU time, max RSS, return code, and stderr tail with the correctness fixture notes. Do not treat benchmark numbers as parity evidence unless the corresponding outputs have already been compared. See BENCHMARKS.md for recorded local snapshots.

For a deterministic local smoke input, generate synthetic reads with tools/generate_synthetic_medium_reads.py. The default medium profile has a recorded Rust/C++ parity benchmark. The high-error and high-repeat profiles currently expose assembly-output divergences and should be used as correctness probes, not performance benchmarks.

Architecture

Module Description
large_int, kmer Multi-precision k-mer representation (1-512bp)
read_holder 2-bit packed DNA storage with zero-alloc k-mer iteration
reads_getter FASTA/FASTQ/gzip reader (noodles + flate2)
bloom_filter, concurrent_hash Concurrent k-mer counting structures
counter, sorted_counter, flat_counter Sorted k-mer counting pipeline
graph_digger De Bruijn graph traversal with fork resolution
assembler Iterative multi-k assembly orchestration
linked_contig ConnectFragments link chain walking
snp_discovery SNP/indel detection at fork points
clean_reads Read-to-contig mapping and filtering
paired_reads Paired-end connection + insert size estimation
guided_path, guided_graph Target-guided assembly (SAUTE)
spider_graph Multi-path enumeration for GFA Connector
glb_align Needleman-Wunsch, Smith-Waterman, BLOSUM62
genetic_code 25 NCBI genetic code tables
gfa GFA 1.0 format output

Citation

This is a port of SKESA. Please cite the original work:

Alexandre Souvorov, Richa Agarwala and David J. Lipman. SKESA: strategic k-mer extension for scrupulous assemblies. Genome Biology 2018 19:153. doi.org/10.1186/s13059-018-1540-z

License

The original SKESA is public domain (US Government Work). This port follows the same terms (Unlicense).