Skip to main content

Crate rsomics_seqstats

Crate rsomics_seqstats 

Source
Expand description

Format-agnostic sequence-statistics primitives shared by the rsomics-*-stats tools. The N50/L50/quartile math is a port of shenwei356/bio util/length-stats.go; the alphabet guess mirrors seqkit’s seq.GuessAlphabet. Sharing this verbatim (rather than re-deriving per format) is what lets --all --tabular byte-agree with seqkit stats -a -T for both FASTA and FASTQ.

Structs§

LengthStats
Port of bio/util/length-stats.go. seqkit’s L50 counts unique-length buckets, not records — reproduced so --tabular --all agrees with seqkit.

Enums§

SeqType

Constants§

DEFAULT_ALPHABET_GUESS_LEN
seqkit guesses the sequence type from the first record only, scanning at most this many bytes of it (seq.AlphabetGuessSeqLengthThreshold, seqkit’s --alphabet-guess-seq-length default). Callers pass the first record’s sequence truncated to this; accumulating across records diverges.

Functions§

classify
seqkit’s alphabet guess over one sequence (pass the first record’s prefix — see DEFAULT_ALPHABET_GUESS_LEN): any protein-only residue ⇒ Protein; else U-without-T ⇒ RNA; else DNA. An empty or non-bio sequence ⇒ Unlimit. Ambiguity codes and gaps do not decide the type.
count_any_of
Count every byte of haystack equal to any byte in needles, deduping needles so overlapping classes (e.g. b"GCgc") are not double-counted.