rsomics-seqstats 0.2.0

Format-agnostic sequence statistics primitives (length distribution: N50/L50/Nx + quartiles, base composition, alphabet guess). Layer-A, shared verbatim by the rsomics-* *-stats tools so output byte-agrees with `seqkit stats -a`.
Documentation
  • Coverage
  • 37.5%
    6 out of 16 items documented0 out of 10 items with examples
  • Size
  • Source code size: 15.49 kB This is the summed size of all the files inside the crates.io package for this release.
  • Documentation size: 422.53 kB This is the summed size of all files generated by rustdoc for all configured targets
  • Ø build duration
  • this release: 36s Average build duration of successful builds.
  • all releases: 33s Average build duration of successful builds in releases after 2024-10-23.
  • Links
  • Bengerthelorf/rsomics-world
    0 0 0
  • crates.io
  • Dependencies
  • Versions
  • Owners
  • Bengerthelorf

rsomics-seqstats

Layer-A foundation primitives for sequence statistics, shared by the rsomics-*-stats tools (rsomics-fasta-stats, rsomics-fastq-stats, …). Library-only — no binary, no upstream to bench, perf-exempt by nature.

It is pure, format-agnostic math: feed it a Vec<u64> of record lengths and a byte slice and it returns the seqkit-compatible length distribution and base composition.

API

  • LengthStats::new(Vec<u64>)q1() / q2() / q3() (length quartiles), n50_l50()(N50, L50) where L50 is seqkit's N50_num (unique-length-bucket count, not record count).
  • count_any_of(haystack, needles) — deduped multi-byte count (GC bases, N runs, gap letters).
  • classify(sample) / SeqType — seqkit's alphabet guess (DNA / RNA / Protein / Unlimit), Serialize with seqkit's exact JSON names. seqkit guesses from the first record only, scanning at most DEFAULT_ALPHABET_GUESS_LEN bytes — callers pass the first record's sequence prefix, never a cross-record accumulation.

Origin

This crate is a Rust port of the length-distribution math in shenwei356/bio util/length-stats.go and the alphabet guess in seqkit seq.GuessAlphabet, so rsomics-*-stats --all --tabular byte-agrees with seqkit stats -a -T. seqkit computes L50 over unique-length buckets rather than records; that quirk is reproduced deliberately for compatibility.

Both upstreams are MIT, so the source was read and is cited directly (the clean-room rule applies only to GPL upstreams).

License: MIT OR Apache-2.0. Upstream credit: shenwei356/bio, shenwei356/seqkit (MIT).