rsomics-seqstats

Layer-A foundation primitives for sequence statistics, shared by the rsomics-*-stats tools (rsomics-fasta-stats, rsomics-fastq-stats, …). Library-only — no binary, no upstream to bench, perf-exempt by nature.

It is pure, format-agnostic math: feed it a Vec<u64> of record lengths and a byte slice and it returns the seqkit-compatible length distribution and base composition.

API

LengthStats::new(Vec<u64>) → q1() / q2() / q3() (length quartiles), n50_l50() → (N50, L50) where L50 is seqkit's N50_num (unique-length-bucket count, not record count).
count_any_of(haystack, needles) — deduped multi-byte count (GC bases, N runs, gap letters).
classify(sample) / SeqType — seqkit's alphabet guess (DNA / RNA / Protein / Unlimit), Serialize with seqkit's exact JSON names. seqkit guesses from the first record only, scanning at most DEFAULT_ALPHABET_GUESS_LEN bytes — callers pass the first record's sequence prefix, never a cross-record accumulation.

Origin

This crate is a Rust port of the length-distribution math in shenwei356/bio util/length-stats.go and the alphabet guess in seqkit seq.GuessAlphabet, so rsomics-*-stats --all --tabular byte-agrees with seqkit stats -a -T. seqkit computes L50 over unique-length buckets rather than records; that quirk is reproduced deliberately for compatibility.

Both upstreams are MIT, so the source was read and is cited directly (the clean-room rule applies only to GPL upstreams).

License: MIT OR Apache-2.0. Upstream credit: shenwei356/bio, shenwei356/seqkit (MIT).

rsomics-seqstats 0.2.0

rsomics-seqstats

API

Origin