fgumi_consensus/base_builder.rs
1//! # Consensus Base Calling
2//!
3//! This module provides the core likelihood-based algorithm for calling a consensus base at a
4//! single position from multiple observations. This is the fundamental building block used by
5//! all consensus callers to determine the most likely base and assign an accurate quality score.
6//!
7//! ## Overview
8//!
9//! Given multiple observations of a position (base + quality score pairs), the consensus base
10//! builder uses Bayesian inference to determine:
11//! 1. Which base (A, C, G, or T) is most likely to be the true base
12//! 2. What quality score to assign to that consensus call
13//!
14//! This differs from simple majority voting by:
15//! - Incorporating quality scores (low-quality observations contribute less)
16//! - Using a probabilistic model that can properly handle ambiguous cases
17//! - Generating quality scores that accurately reflect confidence in the call
18//!
19//! ## Likelihood-Based Model
20//!
21//! For each candidate base B ∈ {A, C, G, T}, we calculate the likelihood of observing all the
22//! data given that B is the true base:
23//!
24//! ```text
25//! L(B | observations) = ∏ P(obs_i | B)
26//! ```
27//!
28//! Where for each observation:
29//! - **If observed base matches B**: `P(obs | B) = 1 - error_rate`
30//! - **If observed base differs from B**: `P(obs | B) = error_rate / 3`
31//!
32//! The division by 3 assumes equal probability of the three incorrect bases.
33//!
34//! ## Quality Score Calculation
35//!
36//! After computing likelihoods for all four bases:
37//!
38//! 1. **Normalize** to get posterior probabilities:
39//! ```text
40//! P(B | data) = L(B) / (L(A) + L(C) + L(G) + L(T))
41//! ```
42//!
43//! 2. **Select** the base with maximum posterior probability
44//!
45//! 3. **Calculate error probability**:
46//! ```text
47//! P_error = 1 - P(best_base | data)
48//! ```
49//!
50//! 4. **Convert to Phred scale**:
51//! ```text
52//! Q = -10 * log10(P_error)
53//! ```
54//!
55//! This Phred score represents the probability that the consensus call is wrong, properly
56//! accounting for all the evidence (both supporting and contradicting).
57//!
58//! ## Error Rate Priors
59//!
60//! The model incorporates two sources of error:
61//!
62//! ### Pre-UMI Error Rate
63//!
64//! Errors present in the original molecule before PCR amplification. These errors will appear
65//! in all reads from that molecule, so they cannot be corrected by consensus calling. This
66//! rate sets a floor on the achievable consensus quality.
67//!
68//! **Typical value**: Q45 (error rate ~3 × 10^-5)
69//!
70//! ### Post-UMI Error Rate
71//!
72//! Errors introduced during sequencing or PCR amplification. These are independent across reads
73//! and can be corrected by consensus calling.
74//!
75//! **Typical value**: Q40 (error rate ~1 × 10^-4)
76//!
77//! ### Combined Error Model
78//!
79//! For each observation with quality Q, the adjusted error probability accounts for both
80//! post-UMI errors (from sequencing) and the quality score:
81//!
82//! ```text
83//! P_adjusted_error = P_post_umi_error + (1 - P_post_umi_error) * P_sequencing_error
84//! ```
85//!
86//! This adjusted probability is used in the likelihood calculation.
87//!
88//! After calling the consensus, the pre-UMI error rate is incorporated:
89//! ```text
90//! P_final_error = P_pre_umi_error + (1 - P_pre_umi_error) * P_consensus_error
91//! ```
92//!
93//! ## Mathematical Details
94//!
95//! The implementation uses **log-space arithmetic** to avoid numerical underflow:
96//!
97//! - All probabilities are stored as natural logarithms
98//! - Products become sums: `log(a * b) = log(a) + log(b)`
99//! - Sums use log-sum-exp trick: `log(e^a + e^b) = log_sum_exp(a, b)`
100//!
101//! This allows accurate computation even when individual likelihoods are extremely small
102//! (e.g., 10^-100).
103//!
104//! ## Usage in Consensus Callers
105//!
106//! Consensus callers typically use `ConsensusBaseBuilder` as follows:
107//!
108//! ```rust,ignore
109//! use fgumi_lib::consensus::base_builder::ConsensusBaseBuilder;
110//!
111//! // Create builder with error rates
112//! let mut builder = ConsensusBaseBuilder::new(
113//! 45, // Pre-UMI error rate (Q45)
114//! 40, // Post-UMI error rate (Q40)
115//! );
116//!
117//! // For each position in the read:
118//! for position in 0..read_length {
119//! builder.reset(); // Clear previous position
120//!
121//! // Add observations from all reads
122//! for read in &reads {
123//! let base = read.sequence()[position];
124//! let qual = read.quality_scores()[position];
125//! builder.add(base, qual);
126//! }
127//!
128//! // Call consensus for this position
129//! let (consensus_base, consensus_qual) = builder.call();
130//!
131//! // Check depth
132//! let depth = builder.contributions();
133//! }
134//! ```
135//!
136//! ## Handling Edge Cases
137//!
138//! ### No Observations
139//! Returns `(N, 0)` - no-call with minimum quality.
140//!
141//! ### N Bases
142//! N (no-call) bases in input reads are ignored. They don't contribute evidence for any
143//! particular base.
144//!
145//! ### Ties
146//! If multiple bases have exactly equal posterior probability, returns `(N, 0)` to indicate
147//! ambiguity. In practice, this is rare due to quality score differences.
148//!
149//! ### Low Depth
150//! With only 1-2 observations, the consensus quality will be similar to the input qualities.
151//! The model naturally handles this - more observations lead to higher confidence.
152//!
153//! ## Quality Score Bounds
154//!
155//! Phred scores are capped at Q93 (error rate ~5 × 10^-10):
156//! - **Minimum**: Q2 (error rate ~0.63)
157//! - **Maximum**: Q93 (error rate ~5 × 10^-10)
158//!
159//! These bounds ensure quality scores fit in standard BAM format and remain numerically stable.
160//!
161//! ## Performance Characteristics
162//!
163//! - **Time complexity**: O(n) where n = number of observations at a position
164//! - **Space complexity**: O(1) - fixed-size arrays for four bases
165//! - **Caching**: Pre-computes error tables for all possible quality scores (Q2-Q93) to avoid
166//! repeated calculations
167//!
168//! ## Example Quality Progression
169//!
170//! Consider calling consensus from multiple reads observing 'A':
171//!
172//! | Reads | All Q30 'A' | Consensus Quality |
173//! |-------|-------------|-------------------|
174//! | 1 | A | ~Q30 |
175//! | 2 | AA | ~Q40 |
176//! | 5 | AAAAA | ~Q50 |
177//! | 10 | AAAAAAAAAA | ~Q60 |
178//! | 20 | (20x A) | ~Q70 |
179//!
180//! The consensus quality increases logarithmically with depth, reflecting the compounding
181//! evidence that all observations support the same base.
182//!
183//! ## See Also
184//!
185//! - `vanilla_consensus_caller`: Uses this builder to call consensus across entire reads
186//! - `duplex_caller`: Uses this builder for single-strand consensus before duplex combination
187//! - `phred`: Utility functions for Phred score and probability conversions
188
189use crate::phred::{
190 LN_ONE, LogProbability, MAX_PHRED, MIN_PHRED, NO_CALL_BASE, PhredScore,
191 ln_error_prob_two_trials, ln_normalize, ln_not, ln_prob_to_phred, ln_sum_exp_array,
192 phred_to_ln_error_prob,
193};
194use approx::abs_diff_eq;
195use std::cmp::Ordering;
196use wide::f64x4;
197
198/// The four DNA bases in the order used throughout consensus calling
199const DNA_BASES: [u8; 4] = [b'A', b'C', b'G', b'T'];
200const DNA_BASE_COUNT: usize = 4;
201
202/// Lookup table for converting ASCII base to index (0-3 for A,C,G,T, 255 for invalid)
203/// This avoids a loop+compare for every base observation
204const BASE_TO_INDEX: [u8; 256] = {
205 let mut table = [255u8; 256];
206 table[b'A' as usize] = 0;
207 table[b'a' as usize] = 0;
208 table[b'C' as usize] = 1;
209 table[b'c' as usize] = 1;
210 table[b'G' as usize] = 2;
211 table[b'g' as usize] = 2;
212 table[b'T' as usize] = 3;
213 table[b't' as usize] = 3;
214 table
215};
216
217/// Builder for calling consensus at a single base position
218///
219/// Accumulates observations (base + quality) and uses a likelihood model to
220/// call the consensus base and quality.
221///
222/// Uses Kahan summation algorithm with SIMD vectorization to maintain numeric stability
223/// when accumulating log-likelihoods across many observations. The 4 DNA bases map
224/// perfectly to f64x4 (256-bit AVX2), enabling parallel computation of all 4 likelihoods.
225pub struct ConsensusBaseBuilder {
226 /// Log-likelihoods for each of the four bases (A, C, G, T) - SIMD vectorized
227 likelihoods: f64x4,
228
229 /// Kahan summation compensation terms for each base - SIMD vectorized
230 compensations: f64x4,
231
232 /// Count of observations for each base
233 observations: [u16; DNA_BASE_COUNT],
234
235 /// Pre-computed correct probabilities adjusted for post-UMI errors
236 adjusted_correct_table: Vec<LogProbability>,
237
238 /// Pre-computed 1/3 of error probability (for wrong base likelihoods)
239 adjusted_error_per_alt: Vec<LogProbability>,
240
241 /// Pre-UMI error rate (log probability)
242 ln_error_pre_umi: LogProbability,
243}
244
245impl ConsensusBaseBuilder {
246 /// Creates a new consensus base builder
247 ///
248 /// # Arguments
249 /// * `error_rate_pre_umi` - Phred-scaled error rate prior to UMI integration
250 /// * `error_rate_post_umi` - Phred-scaled error rate after UMI integration
251 #[must_use]
252 pub fn new(error_rate_pre_umi: PhredScore, error_rate_post_umi: PhredScore) -> Self {
253 let ln_error_post = phred_to_ln_error_prob(error_rate_post_umi);
254
255 // Pre-compute adjusted probabilities for all possible quality scores
256 // Must start from 0 since we index by phred score directly
257 let mut adjusted_correct_table = Vec::with_capacity(MAX_PHRED as usize + 1);
258 let mut adjusted_error_per_alt = Vec::with_capacity(MAX_PHRED as usize + 1);
259
260 for q in 0..=MAX_PHRED {
261 let ln_error_seq = phred_to_ln_error_prob(q);
262 let adjusted_error = ln_error_prob_two_trials(ln_error_post, ln_error_seq);
263
264 adjusted_correct_table.push(ln_not(adjusted_error));
265 // Error for any specific alternate base is 1/3 of total error
266 // ln(p/3) = ln(p) - ln(3)
267 adjusted_error_per_alt.push(adjusted_error - 3.0_f64.ln());
268 }
269
270 Self {
271 likelihoods: f64x4::splat(LN_ONE),
272 compensations: f64x4::splat(0.0),
273 observations: [0; DNA_BASE_COUNT],
274 adjusted_correct_table,
275 adjusted_error_per_alt,
276 ln_error_pre_umi: phred_to_ln_error_prob(error_rate_pre_umi),
277 }
278 }
279
280 /// Resets the builder to process a new position
281 pub fn reset(&mut self) {
282 self.likelihoods = f64x4::splat(LN_ONE);
283 self.compensations = f64x4::splat(0.0);
284 self.observations.fill(0);
285 }
286
287 /// Adds an observation (base + quality) to the consensus
288 ///
289 /// Uses SIMD-vectorized Kahan summation for numeric stability when accumulating
290 /// log-likelihoods. All 4 base likelihoods are updated in parallel using f64x4.
291 ///
292 /// # Arguments
293 /// * `base` - The observed base (A, C, G, T, or N)
294 /// * `qual` - The base quality score
295 pub fn add(&mut self, base: u8, qual: PhredScore) {
296 // Use lookup table to convert base to index (handles both upper and lower case)
297 let matching_idx = BASE_TO_INDEX[base as usize];
298
299 // Ignore N bases and other invalid bases (index 255)
300 if matching_idx == 255 {
301 return;
302 }
303
304 let matching_idx = matching_idx as usize;
305
306 // Get adjusted probabilities for this quality
307 let qual_idx = qual.min(MAX_PHRED) as usize;
308 let ln_correct = self.adjusted_correct_table[qual_idx];
309 let ln_error_per_base = self.adjusted_error_per_alt[qual_idx];
310
311 // Build values array: error for all bases except ln_correct at matching position
312 let mut values = [ln_error_per_base; 4];
313 values[matching_idx] = ln_correct;
314 let values = f64x4::from(values);
315
316 // SIMD-vectorized Kahan summation for numeric stability
317 // y = value - compensation (recovers the lost low-order bits)
318 let y = values - self.compensations;
319 // t = sum + y (the new sum, but low-order bits of y may be lost)
320 let t = self.likelihoods + y;
321 // compensation = (t - sum) - y (algebraically zero, but captures lost bits)
322 self.compensations = (t - self.likelihoods) - y;
323 // Update the sum
324 self.likelihoods = t;
325
326 self.observations[matching_idx] += 1;
327 }
328
329 /// Fast path for unanimous consensus when only one base is observed.
330 ///
331 /// Returns `Some((base, quality))` if all observations are the same base,
332 /// `None` otherwise (requiring full probabilistic consensus).
333 ///
334 /// This optimization avoids the expensive log-sum-exp calculation when
335 /// all reads agree and the likelihood difference is large enough to
336 /// guarantee high quality output.
337 #[inline]
338 fn try_unanimous_fast_path(&self) -> Option<(u8, PhredScore)> {
339 // Count how many bases have observations
340 let mut observed_base_idx = None;
341 let mut num_bases_observed = 0;
342
343 for (i, &count) in self.observations.iter().enumerate() {
344 if count > 0 {
345 num_bases_observed += 1;
346 observed_base_idx = Some(i);
347 if num_bases_observed > 1 {
348 // Multiple bases observed - need full consensus
349 return None;
350 }
351 }
352 }
353
354 // Unanimous: only one base observed
355 let base_idx = observed_base_idx?;
356
357 // Check if the likelihood difference is large enough to skip full calculation.
358 // When ln(L_winner) - ln(L_loser) > threshold, the posterior is essentially 1.
359 // threshold = 23 corresponds to likelihood ratio > 10^10, giving Q > 90
360 #[expect(
361 clippy::items_after_statements,
362 reason = "const is logically scoped to this fast-path check"
363 )]
364 const FAST_PATH_THRESHOLD: f64 = 23.0;
365
366 let likelihoods = self.likelihoods.as_array();
367 let winner_ll = likelihoods[base_idx];
368 let loser_ll = likelihoods[(base_idx + 1) % 4]; // Any loser will do (all same)
369
370 if winner_ll - loser_ll > FAST_PATH_THRESHOLD {
371 // Very high confidence - return quality capped by pre-UMI error rate
372 // The pre-UMI error rate limits the maximum achievable quality
373 let max_qual = ln_prob_to_phred(self.ln_error_pre_umi);
374 return Some((DNA_BASES[base_idx], max_qual));
375 }
376
377 // Likelihood difference not large enough - fall through to full calculation
378 None
379 }
380
381 /// Calls the consensus base and quality
382 ///
383 /// Returns (`consensus_base`, `consensus_quality`)
384 /// Returns (N, 0) if no observations or multiple equally likely bases
385 ///
386 /// # Panics
387 ///
388 /// Panics if there are observations but no maximum likelihood base is found
389 /// (should not occur in practice since we check for ties).
390 #[must_use]
391 pub fn call(&self) -> (u8, PhredScore) {
392 if self.contributions() == 0 {
393 return (NO_CALL_BASE, MIN_PHRED);
394 }
395
396 // Fast path: check for unanimous consensus (only one base observed)
397 // This is common in high-quality data and avoids expensive log-sum-exp
398 if let Some((base, qual)) = self.try_unanimous_fast_path() {
399 return (base, qual);
400 }
401
402 // Extract likelihoods as array for iteration and function calls
403 let likelihoods = self.likelihoods.as_array();
404
405 // Compute the sum of likelihoods for normalization
406 let ln_sum = ln_sum_exp_array(likelihoods);
407
408 // Find the maximum likelihood and check if it's unique
409 let mut max_likelihood = f64::NEG_INFINITY;
410 let mut max_index = None;
411 let mut tie = false;
412
413 for (i, &ll) in likelihoods.iter().enumerate() {
414 match ll.partial_cmp(&max_likelihood) {
415 Some(Ordering::Greater) => {
416 max_likelihood = ll;
417 max_index = Some(i);
418 tie = false;
419 }
420 Some(Ordering::Equal) => {
421 tie = true;
422 }
423 Some(Ordering::Less) => {
424 // Check for epsilon-level tie (within machine precision)
425 if abs_diff_eq!(ll, max_likelihood, epsilon = f64::EPSILON) {
426 tie = true;
427 }
428 }
429 None => {}
430 }
431 }
432
433 // If there's a tie, return no-call
434 if tie || max_index.is_none() {
435 return (NO_CALL_BASE, MIN_PHRED);
436 }
437
438 let max_idx = max_index.unwrap();
439 let consensus_base = DNA_BASES[max_idx];
440
441 // Calculate posterior probability
442 let ln_posterior = ln_normalize(max_likelihood, ln_sum);
443
444 // Convert to error probability
445 let ln_consensus_error = ln_not(ln_posterior);
446
447 // Factor in pre-UMI error rate
448 // P(final_error) = P(pre_UMI_error AND correct_consensus)
449 // + P(no_pre_UMI_error AND consensus_error)
450 // + P(pre_UMI_error AND consensus_error that don't cancel)
451 // Approximation: just combine the two error rates
452 let ln_final_error = ln_error_prob_two_trials(self.ln_error_pre_umi, ln_consensus_error);
453
454 // Convert to Phred score and cap at maximum
455 let qual = ln_prob_to_phred(ln_final_error);
456
457 (consensus_base, qual)
458 }
459
460 /// Returns the total number of contributing observations
461 ///
462 /// This is the sum of observations across all four bases
463 #[must_use]
464 pub fn contributions(&self) -> u16 {
465 self.observations.iter().sum()
466 }
467
468 /// Returns the number of observations for a specific base
469 ///
470 /// # Arguments
471 /// * `base` - The base to query (A, C, G, or T)
472 ///
473 /// Returns 0 for invalid bases
474 #[must_use]
475 pub fn observations_for_base(&self, base: u8) -> u16 {
476 let idx = BASE_TO_INDEX[base as usize];
477 if idx == 255 { 0 } else { self.observations[idx as usize] }
478 }
479
480 /// Returns the observations for all bases as [A, C, G, T]
481 #[must_use]
482 pub fn all_observations(&self) -> [u16; DNA_BASE_COUNT] {
483 self.observations
484 }
485}
486
487#[cfg(test)]
488mod tests {
489 use super::*;
490
491 #[test]
492 fn test_single_base_perfect() {
493 let mut builder = ConsensusBaseBuilder::new(45, 40);
494
495 // Add 10 perfect quality A's
496 for _ in 0..10 {
497 builder.add(b'A', 40);
498 }
499
500 let (base, qual) = builder.call();
501 assert_eq!(base, b'A');
502 assert!(qual >= 40); // Should have high quality
503 assert_eq!(builder.contributions(), 10);
504 }
505
506 #[test]
507 fn test_mixed_bases() {
508 let mut builder = ConsensusBaseBuilder::new(45, 40);
509
510 // Add 8 A's and 2 C's
511 for _ in 0..8 {
512 builder.add(b'A', 30);
513 }
514 for _ in 0..2 {
515 builder.add(b'C', 30);
516 }
517
518 let (base, _qual) = builder.call();
519 assert_eq!(base, b'A'); // Majority should win
520 assert_eq!(builder.contributions(), 10);
521 }
522
523 #[test]
524 fn test_no_observations() {
525 let builder = ConsensusBaseBuilder::new(45, 40);
526
527 let (base, qual) = builder.call();
528 assert_eq!(base, b'N');
529 assert_eq!(qual, 2); // fgbio's PhredScore.MinValue
530 assert_eq!(builder.contributions(), 0);
531 }
532
533 #[test]
534 fn test_ignore_n_bases() {
535 let mut builder = ConsensusBaseBuilder::new(45, 40);
536
537 builder.add(b'A', 40);
538 builder.add(b'N', 40);
539 builder.add(b'A', 40);
540
541 let (base, _qual) = builder.call();
542 assert_eq!(base, b'A');
543 assert_eq!(builder.contributions(), 2); // N should not be counted
544 }
545
546 #[test]
547 fn test_case_insensitive() {
548 let mut builder = ConsensusBaseBuilder::new(45, 40);
549
550 builder.add(b'a', 40);
551 builder.add(b'A', 40);
552 builder.add(b'a', 40);
553
554 let (base, _qual) = builder.call();
555 assert_eq!(base, b'A');
556 assert_eq!(builder.contributions(), 3);
557 }
558
559 #[test]
560 fn test_reset() {
561 let mut builder = ConsensusBaseBuilder::new(45, 40);
562
563 builder.add(b'A', 40);
564 builder.add(b'A', 40);
565 assert_eq!(builder.contributions(), 2);
566
567 builder.reset();
568 assert_eq!(builder.contributions(), 0);
569
570 builder.add(b'C', 40);
571 let (base, _qual) = builder.call();
572 assert_eq!(base, b'C');
573 }
574
575 #[test]
576 fn test_observations_for_base() {
577 let mut builder = ConsensusBaseBuilder::new(45, 40);
578
579 builder.add(b'A', 40);
580 builder.add(b'A', 40);
581 builder.add(b'C', 40);
582 builder.add(b'G', 40);
583
584 assert_eq!(builder.observations_for_base(b'A'), 2);
585 assert_eq!(builder.observations_for_base(b'C'), 1);
586 assert_eq!(builder.observations_for_base(b'G'), 1);
587 assert_eq!(builder.observations_for_base(b'T'), 0);
588 }
589
590 #[test]
591 fn test_quality_variation() {
592 let mut builder = ConsensusBaseBuilder::new(45, 40);
593
594 // Add high quality A's
595 for _ in 0..5 {
596 builder.add(b'A', 40);
597 }
598
599 // Add one low quality C (should not outweigh the A's)
600 builder.add(b'C', 10);
601
602 let (base, _qual) = builder.call();
603 assert_eq!(base, b'A');
604 }
605
606 #[test]
607 fn test_tie_returns_no_call() {
608 let mut builder = ConsensusBaseBuilder::new(45, 40);
609
610 // This is hard to create a perfect tie, but we can test the no-call logic
611 // by having very few observations
612 builder.add(b'A', 10);
613 builder.add(b'C', 10);
614
615 // With low quality and equal counts, might get different results
616 // Just verify it doesn't crash and returns a valid base (A, C, or N)
617 let (base, _qual) = builder.call();
618 assert!(base == b'A' || base == b'C' || base == b'N');
619 }
620
621 // Port of fgbio test: "produce a no-call if two bases are of equal likelihood"
622 #[test]
623 fn test_equal_likelihood_produces_no_call() {
624 // Use MAX_PHRED for error rates to effectively disable error rate adjustment
625 let mut builder = ConsensusBaseBuilder::new(MAX_PHRED, MAX_PHRED);
626
627 // Empty builder should return no-call with quality 2
628 let (base, qual) = builder.call();
629 assert_eq!(base, b'N');
630 assert_eq!(qual, 2);
631
632 // Equal evidence should also return no-call
633 builder.add(b'A', 20);
634 builder.add(b'C', 20);
635 let (base, qual) = builder.call();
636 assert_eq!(base, b'N');
637 assert_eq!(qual, 2);
638 }
639
640 // Port of fgbio test: "calculate consensus base and quality given a massive pileup"
641 #[test]
642 fn test_massive_pileup() {
643 let mut builder = ConsensusBaseBuilder::new(50, 50);
644
645 // Add 1000 Q20 C's
646 for _ in 0..1000 {
647 builder.add(b'C', 20);
648 }
649
650 let (base, qual) = builder.call();
651 assert_eq!(base, b'C');
652 assert_eq!(qual, 50); // Quality capped by pre-UMI error rate
653 assert_eq!(builder.contributions(), 1000);
654 assert_eq!(builder.observations_for_base(b'A'), 0);
655 assert_eq!(builder.observations_for_base(b'C'), 1000);
656 assert_eq!(builder.observations_for_base(b'G'), 0);
657 assert_eq!(builder.observations_for_base(b'T'), 0);
658
659 // Add 10 T's - should still call C
660 for _ in 0..10 {
661 builder.add(b'T', 20);
662 }
663
664 let (base, qual) = builder.call();
665 assert_eq!(base, b'C');
666 assert_eq!(qual, 50);
667 assert_eq!(builder.contributions(), 1010);
668 assert_eq!(builder.observations_for_base(b'T'), 10);
669 }
670
671 // Port of fgbio test: "calculate consensus base and quality given conflicting evidence"
672 #[test]
673 fn test_conflicting_evidence() {
674 let mut builder = ConsensusBaseBuilder::new(50, 50);
675
676 builder.add(b'A', 30);
677 builder.add(b'C', 28);
678
679 let (base, qual) = builder.call();
680 assert_eq!(base, b'A');
681 assert!(qual <= 5, "Quality should be low due to conflicting evidence, got {qual}");
682 }
683
684 // Port of fgbio test: "support calling multiple pileups from the same builder"
685 #[test]
686 fn test_reset_and_reuse_fgbio() {
687 let mut builder = ConsensusBaseBuilder::new(50, 50);
688
689 builder.add(b'A', 20);
690 let (base, qual) = builder.call();
691 assert_eq!(base, b'A');
692 assert_eq!(qual, 20);
693 assert_eq!(builder.contributions(), 1);
694
695 builder.reset();
696
697 builder.add(b'C', 20);
698 let (base, qual) = builder.call();
699 assert_eq!(base, b'C');
700 assert_eq!(qual, 20);
701 assert_eq!(builder.contributions(), 1);
702 }
703
704 // Test that Kahan summation produces consistent results regardless of addition order
705 // This tests the numeric stability improvement from PR #1120
706 #[test]
707 fn test_kahan_summation_order_independence() {
708 // Create two builders with the same error rates
709 let mut builder1 = ConsensusBaseBuilder::new(45, 40);
710 let mut builder2 = ConsensusBaseBuilder::new(45, 40);
711
712 // Add bases in different orders with varying qualities
713 // Builder 1: A's first, then C's
714 for q in [10, 20, 30, 40, 50] {
715 builder1.add(b'A', q);
716 }
717 for q in [15, 25, 35, 45] {
718 builder1.add(b'C', q);
719 }
720
721 // Builder 2: Interleaved
722 builder2.add(b'A', 10);
723 builder2.add(b'C', 15);
724 builder2.add(b'A', 20);
725 builder2.add(b'C', 25);
726 builder2.add(b'A', 30);
727 builder2.add(b'C', 35);
728 builder2.add(b'A', 40);
729 builder2.add(b'C', 45);
730 builder2.add(b'A', 50);
731
732 // Both should produce the same result
733 let (base1, qual1) = builder1.call();
734 let (base2, qual2) = builder2.call();
735
736 assert_eq!(base1, base2, "Consensus base should be order-independent");
737 assert_eq!(qual1, qual2, "Consensus quality should be order-independent");
738 assert_eq!(base1, b'A', "Should call A (5 vs 4 observations)");
739 }
740
741 // Test that Kahan summation handles extreme quality differences without precision loss
742 #[test]
743 fn test_kahan_summation_extreme_quality_range() {
744 let mut builder = ConsensusBaseBuilder::new(45, 40);
745
746 // Add many low quality bases followed by a few high quality bases
747 // Q93 provides MUCH more evidence than Q2 (error rates: ~5e-10 vs ~0.63)
748 // So 10 Q93 observations easily outweigh 100 Q2 observations
749 for _ in 0..100 {
750 builder.add(b'A', 2); // Very low quality (error ~63%)
751 }
752 for _ in 0..10 {
753 builder.add(b'C', 93); // Maximum quality (error ~5e-10)
754 }
755
756 // High quality C's should win - this tests that extreme values are handled correctly
757 let (base, _qual) = builder.call();
758 assert_eq!(base, b'C', "High-quality observations provide much stronger evidence");
759 }
760
761 // Port of fgbio test: "scale base qualities using the post-umi error rate"
762 // Tests that ConsensusBaseBuilder properly adjusts qualities based on post-UMI error rate.
763 // With errorRatePostLabeling=10 (Q10), input qualities should be reduced:
764 // Q20 → Q9, Q15 → Q8, Q10 → Q7, Q5 → Q4
765 #[test]
766 fn test_scale_base_qualities_post_umi_error_rate() {
767 // errorRatePreLabeling=MAX_PHRED (no pre-UMI adjustment), errorRatePostLabeling=10
768 // This matches fgbio's test setup
769 let input_quals: [u8; 4] = [20, 15, 10, 5];
770 let expected_output: [u8; 4] = [9, 8, 7, 4];
771
772 for (input_q, expected_q) in input_quals.iter().zip(expected_output.iter()) {
773 let mut builder = ConsensusBaseBuilder::new(MAX_PHRED, 10);
774 builder.add(b'A', *input_q);
775 let (_base, qual) = builder.call();
776
777 // Allow for small differences due to implementation details
778 // but the general pattern should hold: qualities are reduced
779 assert!(
780 qual <= *input_q,
781 "Q{input_q} should be reduced by post-UMI error rate, got Q{qual}"
782 );
783 assert!(
784 (i16::from(qual) - i16::from(*expected_q)).abs() <= 1,
785 "Q{input_q} should become ~Q{expected_q}, got Q{qual}"
786 );
787 }
788 }
789}