Expand description
Format-agnostic sequence-statistics primitives shared by the
rsomics-*-stats tools. The N50/L50/quartile math is a port of
shenwei356/bio util/length-stats.go; the alphabet guess mirrors
seqkit’s seq.GuessAlphabet. Sharing this verbatim (rather than
re-deriving per format) is what lets --all --tabular byte-agree with
seqkit stats -a -T for both FASTA and FASTQ.
Structs§
- Length
Stats - Port of
bio/util/length-stats.go. seqkit’s L50 counts unique-length buckets, not records — reproduced so--tabular --allagrees with seqkit.
Enums§
Constants§
- DEFAULT_
ALPHABET_ GUESS_ LEN - seqkit guesses the sequence type from the first record only, scanning
at most this many bytes of it (
seq.AlphabetGuessSeqLengthThreshold, seqkit’s--alphabet-guess-seq-lengthdefault). Callers pass the first record’s sequence truncated to this; accumulating across records diverges.
Functions§
- classify
- seqkit’s alphabet guess over one sequence (pass the first record’s prefix —
see
DEFAULT_ALPHABET_GUESS_LEN): any protein-only residue ⇒ Protein; else U-without-T ⇒ RNA; else DNA. An empty or non-bio sequence ⇒ Unlimit. Ambiguity codes and gaps do not decide the type. - count_
any_ of - Count every byte of
haystackequal to any byte inneedles, dedupingneedlesso overlapping classes (e.g.b"GCgc") are not double-counted.