skesa-rs
Pure Rust port of NCBI's SKESA (Strategic K-mer Extension for Scrupulous Assemblies) — a de-novo sequence read assembler for microbial genomes.
Based on SKESA v2.4.0 / SAUTE v1.3.0 (commit 27caba2, 2024-10-11)
(SAUTE is not yet covered by this translation)
- 2026-05-02: Exposed internals better for library use
- 2026-04-27: This crate has only passed a bare minimum of testing. Current local
--cores 4SKESA benchmarks are roughly on par with the original. Use at your own risk
Below is a blurb that we should add to all our crates; it is the latest version
This is an LLM-mediated faithful (hopefully) translation, not the original code!
Most users should probably first see if the existing original code works for them, unless they have reason otherwise. The original source may have newer features and it has had more love in terms of fixing bugs. In fact, we aim to replicate bugs if they are present, for the sake of reproducibility! (but then we might have added a few more in the process)
There are however cases when you might prefer this Rust version. We generally agree with this manifesto but more specifically:
- We have had many issues with ensuring that our software works using existing containers (Docker, PodMan, Singularity). One size does not fit all and it eats our resources trying to keep up with every way of delivering software
- Common package managers do not work well. It was great when we had a few Linux distributions with stable procedures, but now there are just too many ecosystems (Homebrew, Conda). Conda has an NP-complete resolver which does not scale. Homebrew is only so-stable. And our dependencies in Python still break. These can no longer be considered professional serious options. Meanwhile, Cargo enables multiple versions of packages to be available, even within the same program(!)
- The future is the web. We deploy software in the web browser, and until now that has meant Javascript. This is a language where even the == operator is broken. Typescript is one step up, but a game changer is the ability to compile Rust code into webassembly, enabling performance and sharing of code with the backend. Translating code to Rust enables new ways of deployment and running code in the browser has especial benefits for science - researchers do not have deep pockets to run servers, so pushing compute to the user enables deployment that otherwise would be impossible
- Old CLI-based utilities are bad for the environment(!). A large amount of compute resources are spent creating and communicating via small files, which we can bypass by using code as libraries. Even better, we can avoid frequent reloading of databases by hoisting this stage, with up to 100x speedups in some cases. Less compute means faster compute and less electricity wasted
- LLM-mediated translations may actually be safer to use than the original code. This article shows that running the same code on different operating systems can give somewhat different answers. This is a gap that Rust+Cargo can reduce. Typesafe interfaces also reduce coding mistakes and error handling, as opposed to typical command-line scripting
But:
- This approach should still be considered experimental. The LLM technology is immature and has sharp corners. But there are opportunities to reap, and the genie is not going back into the bottle. This translation is as much aimed to learn how to improve the technology and get feedback on the results.
- Translations are not endorsed by the original authors unless otherwise noted. Do not send bug reports to the original developers. Use our Github issues page instead.
- Do not trust the benchmarks on this page. They are used to help evaluate the translation. If you want improved performance, you generally have to use this code as a library, and use the additional tricks it offers. We generally accept performance losses in order to reduce our dependency issues
- Check the original Github pages for information about the package. This README is kept sparse on purpose. It is not meant to be the primary source of information
- If you are the author of the original code and wish to move to Rust, you can obtain ownership of this repository and crate. Until then, our commitment is to offer an as-faithful-as-possible translation of a snapshot of your code. If we find serious bugs, we will report them to you. Otherwise we will just replicate them, to ensure comparability across studies that claim to use package XYZ v.666. Think of this like a fancy Ubuntu .deb-package of your software - that is how we treat it
This blurb might be out of date. Go to this page for the latest information and further information about how we approach translation
Missing Features
saute,saute-prot, andgfa-connectorfull parity- SRA input is not supported in the Rust port;
--sra_run/--sra-runare rejected with a clear error.
Building
# Library only (default; no clap dependency)
# Library + CLI
# Native CPU optimizations (recommended for benchmarking)
RUSTFLAGS="-C target-cpu=native"
Usage
K-mer counting
Assembly
# Basic assembly
# With options
# GFA output
Library usage
use ReadsGetter;
use sorted_counter;
use ;
// Load reads from file
let rg = new.unwrap;
// Count k-mers
let mut kmers = count_kmers_sorted;
get_branches;
// Assemble contigs
let bins = get_bins;
let contigs = assemble_contigs;
for contig in &contigs
Testing
For optional real-world coverage without vendoring public reads into the repo, download the external fixture(s) first and then run the ignored parity test:
SKESA_EXTERNAL_DATA_DIR=/somewhere/external \
TMPDIR=/somewhere/skesa \
The initial external fixture is ENA run ERR486835, a small public paired-end
Mycoplasmoides genitalium WGS dataset. The downloader trims the first 5k read
pairs into subset_1.fastq / subset_2.fastq, and the ignored test currently
compares Rust and bundled C++ kmercounter --hist output on subset_1.fastq.
Larger external assembly parity fixtures should be added only after the
remaining multi-step assembly gaps are closed.
Benchmarking
Use tools/benchmark_command.py for parity/performance measurements so Rust and
bundled C++ commands record the same metadata. Put large temporary inputs,
outputs, and JSON results under /somewhere when that
path is writable in the current environment.
Current local diagnostic snapshot (SRR1703350_10x, paired FASTQ.gz, --cores 4,
--kmer 21 --min-count 2 --vector-percent 1.0 --estimated-kmers 100 --min-contig 200, contigs-only output):
| implementation | wall time | CPU time | max RSS | output SHA-256 |
|---|---|---|---|---|
Rust target/release/skesa-rs |
1:23.80 |
211.13s |
785,596 KB |
37788987fae96213428bb0782826823c188c3cac58ab0c1903f8bbbe28893060 |
bundled C++ SKESA/skesa |
1:41.16 |
225.57s |
781,940 KB |
37788987fae96213428bb0782826823c188c3cac58ab0c1903f8bbbe28893060 |
Treat this as a local diagnostic run, not a portable performance claim. The Rust and C++ output files were compared byte-for-byte for this snapshot.
Record the command, platform, wall time, CPU time, max RSS, return code, and
stderr tail with the correctness fixture notes. Do not treat benchmark numbers as
parity evidence unless the corresponding outputs have already been compared. See
BENCHMARKS.md for recorded local snapshots.
For a deterministic local smoke input, generate synthetic reads with
tools/generate_synthetic_medium_reads.py. The default medium profile has a
recorded Rust/C++ parity benchmark. The high-error and high-repeat profiles
currently expose assembly-output divergences and should be used as correctness
probes, not performance benchmarks.
Architecture
| Module | Description |
|---|---|
large_int, kmer |
Multi-precision k-mer representation (1-512bp) |
read_holder |
2-bit packed DNA storage with zero-alloc k-mer iteration |
reads_getter |
FASTA/FASTQ/gzip reader (noodles + flate2) |
bloom_filter, concurrent_hash |
Concurrent k-mer counting structures |
counter, sorted_counter, flat_counter |
Sorted k-mer counting pipeline |
graph_digger |
De Bruijn graph traversal with fork resolution |
assembler |
Iterative multi-k assembly orchestration |
linked_contig |
ConnectFragments link chain walking |
snp_discovery |
SNP/indel detection at fork points |
clean_reads |
Read-to-contig mapping and filtering |
paired_reads |
Paired-end connection + insert size estimation |
guided_path, guided_graph |
Target-guided assembly (SAUTE) |
spider_graph |
Multi-path enumeration for GFA Connector |
glb_align |
Needleman-Wunsch, Smith-Waterman, BLOSUM62 |
genetic_code |
25 NCBI genetic code tables |
gfa |
GFA 1.0 format output |
Citation
This is a port of SKESA. Please cite the original work:
Alexandre Souvorov, Richa Agarwala and David J. Lipman. SKESA: strategic k-mer extension for scrupulous assemblies. Genome Biology 2018 19:153. doi.org/10.1186/s13059-018-1540-z
License
The original SKESA is public domain (US Government Work). This port follows the same terms (Unlicense).