rsomics-seqstats
Layer-A foundation primitives for sequence statistics, shared by the
rsomics-*-stats tools (rsomics-fasta-stats, rsomics-fastq-stats, …).
Library-only — no binary, no upstream to bench, perf-exempt by nature.
It is pure, format-agnostic math: feed it a Vec<u64> of record lengths and a
byte slice and it returns the seqkit-compatible length distribution and base
composition.
API
LengthStats::new(Vec<u64>)→q1()/q2()/q3()(length quartiles),n50_l50()→(N50, L50)whereL50is seqkit'sN50_num(unique-length-bucket count, not record count).count_any_of(haystack, needles)— deduped multi-byte count (GC bases, N runs, gap letters).classify(sample)/SeqType— seqkit's alphabet guess (DNA / RNA / Protein / Unlimit),Serializewith seqkit's exact JSON names. seqkit guesses from the first record only, scanning at mostDEFAULT_ALPHABET_GUESS_LENbytes — callers pass the first record's sequence prefix, never a cross-record accumulation.
Origin
This crate is a Rust port of the length-distribution math in
shenwei356/bio util/length-stats.go
and the alphabet guess in seqkit
seq.GuessAlphabet, so rsomics-*-stats --all --tabular byte-agrees with
seqkit stats -a -T. seqkit computes L50 over unique-length buckets rather
than records; that quirk is reproduced deliberately for compatibility.
Both upstreams are MIT, so the source was read and is cited directly (the clean-room rule applies only to GPL upstreams).
License: MIT OR Apache-2.0. Upstream credit: shenwei356/bio, shenwei356/seqkit (MIT).