Skip to main content

Crate rsomics_bam_calmd

Crate rsomics_bam_calmd 

Source
Expand description

samtools calmd port: recompute the MD and NM aux tags of every alignment against a reference FASTA, then re-emit the BAM.

The MD/NM walk mirrors bam_fillmd1_core in samtools bam_md.c exactly, including its nibble-level base comparison (htslib seq_nt16_table / bam_seqi), its run-length MD string with ^-prefixed deletions, and its “only rewrite the tag when the value differs” guard so unchanged records stay byte-for-byte identical (matching aux ordering and integer subtype).

Reading uses the shared rsomics_bamio reader (libdeflate BGZF, parallel at workers >= 2); records are processed on the raw BAM byte level — seq/qual/cigar are never decoded into noodles types. MD/NM are written directly into the raw aux tail via RawRecord::set_aux. This avoids the full RecordBuf decode+re-encode round-trip (the former bottleneck at -t4 and above, accounting for 67% of wall time). Output goes through the bamio work-stealing BGZF writer — at workers >= 2 via the WsBatchBamWriter, which frames each batch on its own thread so the main thread’s per-record write cost no longer caps multi-thread throughput.

The reference is read from an indexed FASTA (.fai); each contig is fetched once on first use and cached for the run. Coordinate-sorted input therefore touches each contig once and never re-reads it — the common calmd case.

At workers >= 2 the MD/NM computation is parallelised with rayon: records are collected into a batch, the needed contigs are fetched serially into a shared read-only map (Arc<Vec<u8>> per contig), then par_iter_mut runs the raw MD/NM pass on every record simultaneously. Output is written in original batch order so the byte stream is identical to the serial path.

Structs§

CalmdOpts
CalmdStats

Functions§

calmd