fgumi_consensus/
base_builder.rs

1//! # Consensus Base Calling
2//!
3//! This module provides the core likelihood-based algorithm for calling a consensus base at a
4//! single position from multiple observations. This is the fundamental building block used by
5//! all consensus callers to determine the most likely base and assign an accurate quality score.
6//!
7//! ## Overview
8//!
9//! Given multiple observations of a position (base + quality score pairs), the consensus base
10//! builder uses Bayesian inference to determine:
11//! 1. Which base (A, C, G, or T) is most likely to be the true base
12//! 2. What quality score to assign to that consensus call
13//!
14//! This differs from simple majority voting by:
15//! - Incorporating quality scores (low-quality observations contribute less)
16//! - Using a probabilistic model that can properly handle ambiguous cases
17//! - Generating quality scores that accurately reflect confidence in the call
18//!
19//! ## Likelihood-Based Model
20//!
21//! For each candidate base B ∈ {A, C, G, T}, we calculate the likelihood of observing all the
22//! data given that B is the true base:
23//!
24//! ```text
25//! L(B | observations) = ∏ P(obs_i | B)
26//! ```
27//!
28//! Where for each observation:
29//! - **If observed base matches B**: `P(obs | B) = 1 - error_rate`
30//! - **If observed base differs from B**: `P(obs | B) = error_rate / 3`
31//!
32//! The division by 3 assumes equal probability of the three incorrect bases.
33//!
34//! ## Quality Score Calculation
35//!
36//! After computing likelihoods for all four bases:
37//!
38//! 1. **Normalize** to get posterior probabilities:
39//!    ```text
40//!    P(B | data) = L(B) / (L(A) + L(C) + L(G) + L(T))
41//!    ```
42//!
43//! 2. **Select** the base with maximum posterior probability
44//!
45//! 3. **Calculate error probability**:
46//!    ```text
47//!    P_error = 1 - P(best_base | data)
48//!    ```
49//!
50//! 4. **Convert to Phred scale**:
51//!    ```text
52//!    Q = -10 * log10(P_error)
53//!    ```
54//!
55//! This Phred score represents the probability that the consensus call is wrong, properly
56//! accounting for all the evidence (both supporting and contradicting).
57//!
58//! ## Error Rate Priors
59//!
60//! The model incorporates two sources of error:
61//!
62//! ### Pre-UMI Error Rate
63//!
64//! Errors present in the original molecule before PCR amplification. These errors will appear
65//! in all reads from that molecule, so they cannot be corrected by consensus calling. This
66//! rate sets a floor on the achievable consensus quality.
67//!
68//! **Typical value**: Q45 (error rate ~3 × 10^-5)
69//!
70//! ### Post-UMI Error Rate
71//!
72//! Errors introduced during sequencing or PCR amplification. These are independent across reads
73//! and can be corrected by consensus calling.
74//!
75//! **Typical value**: Q40 (error rate ~1 × 10^-4)
76//!
77//! ### Combined Error Model
78//!
79//! For each observation with quality Q, the adjusted error probability accounts for both
80//! post-UMI errors (from sequencing) and the quality score:
81//!
82//! ```text
83//! P_adjusted_error = P_post_umi_error + (1 - P_post_umi_error) * P_sequencing_error
84//! ```
85//!
86//! This adjusted probability is used in the likelihood calculation.
87//!
88//! After calling the consensus, the pre-UMI error rate is incorporated:
89//! ```text
90//! P_final_error = P_pre_umi_error + (1 - P_pre_umi_error) * P_consensus_error
91//! ```
92//!
93//! ## Mathematical Details
94//!
95//! The implementation uses **log-space arithmetic** to avoid numerical underflow:
96//!
97//! - All probabilities are stored as natural logarithms
98//! - Products become sums: `log(a * b) = log(a) + log(b)`
99//! - Sums use log-sum-exp trick: `log(e^a + e^b) = log_sum_exp(a, b)`
100//!
101//! This allows accurate computation even when individual likelihoods are extremely small
102//! (e.g., 10^-100).
103//!
104//! ## Usage in Consensus Callers
105//!
106//! Consensus callers typically use `ConsensusBaseBuilder` as follows:
107//!
108//! ```rust,ignore
109//! use fgumi_lib::consensus::base_builder::ConsensusBaseBuilder;
110//!
111//! // Create builder with error rates
112//! let mut builder = ConsensusBaseBuilder::new(
113//!     45,  // Pre-UMI error rate (Q45)
114//!     40,  // Post-UMI error rate (Q40)
115//! );
116//!
117//! // For each position in the read:
118//! for position in 0..read_length {
119//!     builder.reset(); // Clear previous position
120//!
121//!     // Add observations from all reads
122//!     for read in &reads {
123//!         let base = read.sequence()[position];
124//!         let qual = read.quality_scores()[position];
125//!         builder.add(base, qual);
126//!     }
127//!
128//!     // Call consensus for this position
129//!     let (consensus_base, consensus_qual) = builder.call();
130//!
131//!     // Check depth
132//!     let depth = builder.contributions();
133//! }
134//! ```
135//!
136//! ## Handling Edge Cases
137//!
138//! ### No Observations
139//! Returns `(N, 0)` - no-call with minimum quality.
140//!
141//! ### N Bases
142//! N (no-call) bases in input reads are ignored. They don't contribute evidence for any
143//! particular base.
144//!
145//! ### Ties
146//! If multiple bases have exactly equal posterior probability, returns `(N, 0)` to indicate
147//! ambiguity. In practice, this is rare due to quality score differences.
148//!
149//! ### Low Depth
150//! With only 1-2 observations, the consensus quality will be similar to the input qualities.
151//! The model naturally handles this - more observations lead to higher confidence.
152//!
153//! ## Quality Score Bounds
154//!
155//! Phred scores are capped at Q93 (error rate ~5 × 10^-10):
156//! - **Minimum**: Q2 (error rate ~0.63)
157//! - **Maximum**: Q93 (error rate ~5 × 10^-10)
158//!
159//! These bounds ensure quality scores fit in standard BAM format and remain numerically stable.
160//!
161//! ## Performance Characteristics
162//!
163//! - **Time complexity**: O(n) where n = number of observations at a position
164//! - **Space complexity**: O(1) - fixed-size arrays for four bases
165//! - **Caching**: Pre-computes error tables for all possible quality scores (Q2-Q93) to avoid
166//!   repeated calculations
167//!
168//! ## Example Quality Progression
169//!
170//! Consider calling consensus from multiple reads observing 'A':
171//!
172//! | Reads | All Q30 'A' | Consensus Quality |
173//! |-------|-------------|-------------------|
174//! | 1     | A           | ~Q30              |
175//! | 2     | AA          | ~Q40              |
176//! | 5     | AAAAA       | ~Q50              |
177//! | 10    | AAAAAAAAAA  | ~Q60              |
178//! | 20    | (20x A)     | ~Q70              |
179//!
180//! The consensus quality increases logarithmically with depth, reflecting the compounding
181//! evidence that all observations support the same base.
182//!
183//! ## See Also
184//!
185//! - `vanilla_consensus_caller`: Uses this builder to call consensus across entire reads
186//! - `duplex_caller`: Uses this builder for single-strand consensus before duplex combination
187//! - `phred`: Utility functions for Phred score and probability conversions
188
189use crate::phred::{
190    LN_ONE, LogProbability, MAX_PHRED, MIN_PHRED, NO_CALL_BASE, PhredScore,
191    ln_error_prob_two_trials, ln_normalize, ln_not, ln_prob_to_phred, ln_sum_exp_array,
192    phred_to_ln_error_prob,
193};
194use approx::abs_diff_eq;
195use std::cmp::Ordering;
196use wide::f64x4;
197
198/// The four DNA bases in the order used throughout consensus calling
199const DNA_BASES: [u8; 4] = [b'A', b'C', b'G', b'T'];
200const DNA_BASE_COUNT: usize = 4;
201
202/// Lookup table for converting ASCII base to index (0-3 for A,C,G,T, 255 for invalid)
203/// This avoids a loop+compare for every base observation
204const BASE_TO_INDEX: [u8; 256] = {
205    let mut table = [255u8; 256];
206    table[b'A' as usize] = 0;
207    table[b'a' as usize] = 0;
208    table[b'C' as usize] = 1;
209    table[b'c' as usize] = 1;
210    table[b'G' as usize] = 2;
211    table[b'g' as usize] = 2;
212    table[b'T' as usize] = 3;
213    table[b't' as usize] = 3;
214    table
215};
216
217/// Builder for calling consensus at a single base position
218///
219/// Accumulates observations (base + quality) and uses a likelihood model to
220/// call the consensus base and quality.
221///
222/// Uses Kahan summation algorithm with SIMD vectorization to maintain numeric stability
223/// when accumulating log-likelihoods across many observations. The 4 DNA bases map
224/// perfectly to f64x4 (256-bit AVX2), enabling parallel computation of all 4 likelihoods.
225pub struct ConsensusBaseBuilder {
226    /// Log-likelihoods for each of the four bases (A, C, G, T) - SIMD vectorized
227    likelihoods: f64x4,
228
229    /// Kahan summation compensation terms for each base - SIMD vectorized
230    compensations: f64x4,
231
232    /// Count of observations for each base
233    observations: [u16; DNA_BASE_COUNT],
234
235    /// Pre-computed correct probabilities adjusted for post-UMI errors
236    adjusted_correct_table: Vec<LogProbability>,
237
238    /// Pre-computed 1/3 of error probability (for wrong base likelihoods)
239    adjusted_error_per_alt: Vec<LogProbability>,
240
241    /// Pre-UMI error rate (log probability)
242    ln_error_pre_umi: LogProbability,
243}
244
245impl ConsensusBaseBuilder {
246    /// Creates a new consensus base builder
247    ///
248    /// # Arguments
249    /// * `error_rate_pre_umi` - Phred-scaled error rate prior to UMI integration
250    /// * `error_rate_post_umi` - Phred-scaled error rate after UMI integration
251    #[must_use]
252    pub fn new(error_rate_pre_umi: PhredScore, error_rate_post_umi: PhredScore) -> Self {
253        let ln_error_post = phred_to_ln_error_prob(error_rate_post_umi);
254
255        // Pre-compute adjusted probabilities for all possible quality scores
256        // Must start from 0 since we index by phred score directly
257        let mut adjusted_correct_table = Vec::with_capacity(MAX_PHRED as usize + 1);
258        let mut adjusted_error_per_alt = Vec::with_capacity(MAX_PHRED as usize + 1);
259
260        for q in 0..=MAX_PHRED {
261            let ln_error_seq = phred_to_ln_error_prob(q);
262            let adjusted_error = ln_error_prob_two_trials(ln_error_post, ln_error_seq);
263
264            adjusted_correct_table.push(ln_not(adjusted_error));
265            // Error for any specific alternate base is 1/3 of total error
266            // ln(p/3) = ln(p) - ln(3)
267            adjusted_error_per_alt.push(adjusted_error - 3.0_f64.ln());
268        }
269
270        Self {
271            likelihoods: f64x4::splat(LN_ONE),
272            compensations: f64x4::splat(0.0),
273            observations: [0; DNA_BASE_COUNT],
274            adjusted_correct_table,
275            adjusted_error_per_alt,
276            ln_error_pre_umi: phred_to_ln_error_prob(error_rate_pre_umi),
277        }
278    }
279
280    /// Resets the builder to process a new position
281    pub fn reset(&mut self) {
282        self.likelihoods = f64x4::splat(LN_ONE);
283        self.compensations = f64x4::splat(0.0);
284        self.observations.fill(0);
285    }
286
287    /// Adds an observation (base + quality) to the consensus
288    ///
289    /// Uses SIMD-vectorized Kahan summation for numeric stability when accumulating
290    /// log-likelihoods. All 4 base likelihoods are updated in parallel using f64x4.
291    ///
292    /// # Arguments
293    /// * `base` - The observed base (A, C, G, T, or N)
294    /// * `qual` - The base quality score
295    pub fn add(&mut self, base: u8, qual: PhredScore) {
296        // Use lookup table to convert base to index (handles both upper and lower case)
297        let matching_idx = BASE_TO_INDEX[base as usize];
298
299        // Ignore N bases and other invalid bases (index 255)
300        if matching_idx == 255 {
301            return;
302        }
303
304        let matching_idx = matching_idx as usize;
305
306        // Get adjusted probabilities for this quality
307        let qual_idx = qual.min(MAX_PHRED) as usize;
308        let ln_correct = self.adjusted_correct_table[qual_idx];
309        let ln_error_per_base = self.adjusted_error_per_alt[qual_idx];
310
311        // Build values array: error for all bases except ln_correct at matching position
312        let mut values = [ln_error_per_base; 4];
313        values[matching_idx] = ln_correct;
314        let values = f64x4::from(values);
315
316        // SIMD-vectorized Kahan summation for numeric stability
317        // y = value - compensation (recovers the lost low-order bits)
318        let y = values - self.compensations;
319        // t = sum + y (the new sum, but low-order bits of y may be lost)
320        let t = self.likelihoods + y;
321        // compensation = (t - sum) - y (algebraically zero, but captures lost bits)
322        self.compensations = (t - self.likelihoods) - y;
323        // Update the sum
324        self.likelihoods = t;
325
326        self.observations[matching_idx] += 1;
327    }
328
329    /// Fast path for unanimous consensus when only one base is observed.
330    ///
331    /// Returns `Some((base, quality))` if all observations are the same base,
332    /// `None` otherwise (requiring full probabilistic consensus).
333    ///
334    /// This optimization avoids the expensive log-sum-exp calculation when
335    /// all reads agree and the likelihood difference is large enough to
336    /// guarantee high quality output.
337    #[inline]
338    fn try_unanimous_fast_path(&self) -> Option<(u8, PhredScore)> {
339        // Count how many bases have observations
340        let mut observed_base_idx = None;
341        let mut num_bases_observed = 0;
342
343        for (i, &count) in self.observations.iter().enumerate() {
344            if count > 0 {
345                num_bases_observed += 1;
346                observed_base_idx = Some(i);
347                if num_bases_observed > 1 {
348                    // Multiple bases observed - need full consensus
349                    return None;
350                }
351            }
352        }
353
354        // Unanimous: only one base observed
355        let base_idx = observed_base_idx?;
356
357        // Check if the likelihood difference is large enough to skip full calculation.
358        // When ln(L_winner) - ln(L_loser) > threshold, the posterior is essentially 1.
359        // threshold = 23 corresponds to likelihood ratio > 10^10, giving Q > 90
360        #[expect(
361            clippy::items_after_statements,
362            reason = "const is logically scoped to this fast-path check"
363        )]
364        const FAST_PATH_THRESHOLD: f64 = 23.0;
365
366        let likelihoods = self.likelihoods.as_array();
367        let winner_ll = likelihoods[base_idx];
368        let loser_ll = likelihoods[(base_idx + 1) % 4]; // Any loser will do (all same)
369
370        if winner_ll - loser_ll > FAST_PATH_THRESHOLD {
371            // Very high confidence - return quality capped by pre-UMI error rate
372            // The pre-UMI error rate limits the maximum achievable quality
373            let max_qual = ln_prob_to_phred(self.ln_error_pre_umi);
374            return Some((DNA_BASES[base_idx], max_qual));
375        }
376
377        // Likelihood difference not large enough - fall through to full calculation
378        None
379    }
380
381    /// Calls the consensus base and quality
382    ///
383    /// Returns (`consensus_base`, `consensus_quality`)
384    /// Returns (N, 0) if no observations or multiple equally likely bases
385    ///
386    /// # Panics
387    ///
388    /// Panics if there are observations but no maximum likelihood base is found
389    /// (should not occur in practice since we check for ties).
390    #[must_use]
391    pub fn call(&self) -> (u8, PhredScore) {
392        if self.contributions() == 0 {
393            return (NO_CALL_BASE, MIN_PHRED);
394        }
395
396        // Fast path: check for unanimous consensus (only one base observed)
397        // This is common in high-quality data and avoids expensive log-sum-exp
398        if let Some((base, qual)) = self.try_unanimous_fast_path() {
399            return (base, qual);
400        }
401
402        // Extract likelihoods as array for iteration and function calls
403        let likelihoods = self.likelihoods.as_array();
404
405        // Compute the sum of likelihoods for normalization
406        let ln_sum = ln_sum_exp_array(likelihoods);
407
408        // Find the maximum likelihood and check if it's unique
409        let mut max_likelihood = f64::NEG_INFINITY;
410        let mut max_index = None;
411        let mut tie = false;
412
413        for (i, &ll) in likelihoods.iter().enumerate() {
414            match ll.partial_cmp(&max_likelihood) {
415                Some(Ordering::Greater) => {
416                    max_likelihood = ll;
417                    max_index = Some(i);
418                    tie = false;
419                }
420                Some(Ordering::Equal) => {
421                    tie = true;
422                }
423                Some(Ordering::Less) => {
424                    // Check for epsilon-level tie (within machine precision)
425                    if abs_diff_eq!(ll, max_likelihood, epsilon = f64::EPSILON) {
426                        tie = true;
427                    }
428                }
429                None => {}
430            }
431        }
432
433        // If there's a tie, return no-call
434        if tie || max_index.is_none() {
435            return (NO_CALL_BASE, MIN_PHRED);
436        }
437
438        let max_idx = max_index.unwrap();
439        let consensus_base = DNA_BASES[max_idx];
440
441        // Calculate posterior probability
442        let ln_posterior = ln_normalize(max_likelihood, ln_sum);
443
444        // Convert to error probability
445        let ln_consensus_error = ln_not(ln_posterior);
446
447        // Factor in pre-UMI error rate
448        // P(final_error) = P(pre_UMI_error AND correct_consensus)
449        //                  + P(no_pre_UMI_error AND consensus_error)
450        //                  + P(pre_UMI_error AND consensus_error that don't cancel)
451        // Approximation: just combine the two error rates
452        let ln_final_error = ln_error_prob_two_trials(self.ln_error_pre_umi, ln_consensus_error);
453
454        // Convert to Phred score and cap at maximum
455        let qual = ln_prob_to_phred(ln_final_error);
456
457        (consensus_base, qual)
458    }
459
460    /// Returns the total number of contributing observations
461    ///
462    /// This is the sum of observations across all four bases
463    #[must_use]
464    pub fn contributions(&self) -> u16 {
465        self.observations.iter().sum()
466    }
467
468    /// Returns the number of observations for a specific base
469    ///
470    /// # Arguments
471    /// * `base` - The base to query (A, C, G, or T)
472    ///
473    /// Returns 0 for invalid bases
474    #[must_use]
475    pub fn observations_for_base(&self, base: u8) -> u16 {
476        let idx = BASE_TO_INDEX[base as usize];
477        if idx == 255 { 0 } else { self.observations[idx as usize] }
478    }
479
480    /// Returns the observations for all bases as [A, C, G, T]
481    #[must_use]
482    pub fn all_observations(&self) -> [u16; DNA_BASE_COUNT] {
483        self.observations
484    }
485}
486
487#[cfg(test)]
488mod tests {
489    use super::*;
490
491    #[test]
492    fn test_single_base_perfect() {
493        let mut builder = ConsensusBaseBuilder::new(45, 40);
494
495        // Add 10 perfect quality A's
496        for _ in 0..10 {
497            builder.add(b'A', 40);
498        }
499
500        let (base, qual) = builder.call();
501        assert_eq!(base, b'A');
502        assert!(qual >= 40); // Should have high quality
503        assert_eq!(builder.contributions(), 10);
504    }
505
506    #[test]
507    fn test_mixed_bases() {
508        let mut builder = ConsensusBaseBuilder::new(45, 40);
509
510        // Add 8 A's and 2 C's
511        for _ in 0..8 {
512            builder.add(b'A', 30);
513        }
514        for _ in 0..2 {
515            builder.add(b'C', 30);
516        }
517
518        let (base, _qual) = builder.call();
519        assert_eq!(base, b'A'); // Majority should win
520        assert_eq!(builder.contributions(), 10);
521    }
522
523    #[test]
524    fn test_no_observations() {
525        let builder = ConsensusBaseBuilder::new(45, 40);
526
527        let (base, qual) = builder.call();
528        assert_eq!(base, b'N');
529        assert_eq!(qual, 2); // fgbio's PhredScore.MinValue
530        assert_eq!(builder.contributions(), 0);
531    }
532
533    #[test]
534    fn test_ignore_n_bases() {
535        let mut builder = ConsensusBaseBuilder::new(45, 40);
536
537        builder.add(b'A', 40);
538        builder.add(b'N', 40);
539        builder.add(b'A', 40);
540
541        let (base, _qual) = builder.call();
542        assert_eq!(base, b'A');
543        assert_eq!(builder.contributions(), 2); // N should not be counted
544    }
545
546    #[test]
547    fn test_case_insensitive() {
548        let mut builder = ConsensusBaseBuilder::new(45, 40);
549
550        builder.add(b'a', 40);
551        builder.add(b'A', 40);
552        builder.add(b'a', 40);
553
554        let (base, _qual) = builder.call();
555        assert_eq!(base, b'A');
556        assert_eq!(builder.contributions(), 3);
557    }
558
559    #[test]
560    fn test_reset() {
561        let mut builder = ConsensusBaseBuilder::new(45, 40);
562
563        builder.add(b'A', 40);
564        builder.add(b'A', 40);
565        assert_eq!(builder.contributions(), 2);
566
567        builder.reset();
568        assert_eq!(builder.contributions(), 0);
569
570        builder.add(b'C', 40);
571        let (base, _qual) = builder.call();
572        assert_eq!(base, b'C');
573    }
574
575    #[test]
576    fn test_observations_for_base() {
577        let mut builder = ConsensusBaseBuilder::new(45, 40);
578
579        builder.add(b'A', 40);
580        builder.add(b'A', 40);
581        builder.add(b'C', 40);
582        builder.add(b'G', 40);
583
584        assert_eq!(builder.observations_for_base(b'A'), 2);
585        assert_eq!(builder.observations_for_base(b'C'), 1);
586        assert_eq!(builder.observations_for_base(b'G'), 1);
587        assert_eq!(builder.observations_for_base(b'T'), 0);
588    }
589
590    #[test]
591    fn test_quality_variation() {
592        let mut builder = ConsensusBaseBuilder::new(45, 40);
593
594        // Add high quality A's
595        for _ in 0..5 {
596            builder.add(b'A', 40);
597        }
598
599        // Add one low quality C (should not outweigh the A's)
600        builder.add(b'C', 10);
601
602        let (base, _qual) = builder.call();
603        assert_eq!(base, b'A');
604    }
605
606    #[test]
607    fn test_tie_returns_no_call() {
608        let mut builder = ConsensusBaseBuilder::new(45, 40);
609
610        // This is hard to create a perfect tie, but we can test the no-call logic
611        // by having very few observations
612        builder.add(b'A', 10);
613        builder.add(b'C', 10);
614
615        // With low quality and equal counts, might get different results
616        // Just verify it doesn't crash and returns a valid base (A, C, or N)
617        let (base, _qual) = builder.call();
618        assert!(base == b'A' || base == b'C' || base == b'N');
619    }
620
621    // Port of fgbio test: "produce a no-call if two bases are of equal likelihood"
622    #[test]
623    fn test_equal_likelihood_produces_no_call() {
624        // Use MAX_PHRED for error rates to effectively disable error rate adjustment
625        let mut builder = ConsensusBaseBuilder::new(MAX_PHRED, MAX_PHRED);
626
627        // Empty builder should return no-call with quality 2
628        let (base, qual) = builder.call();
629        assert_eq!(base, b'N');
630        assert_eq!(qual, 2);
631
632        // Equal evidence should also return no-call
633        builder.add(b'A', 20);
634        builder.add(b'C', 20);
635        let (base, qual) = builder.call();
636        assert_eq!(base, b'N');
637        assert_eq!(qual, 2);
638    }
639
640    // Port of fgbio test: "calculate consensus base and quality given a massive pileup"
641    #[test]
642    fn test_massive_pileup() {
643        let mut builder = ConsensusBaseBuilder::new(50, 50);
644
645        // Add 1000 Q20 C's
646        for _ in 0..1000 {
647            builder.add(b'C', 20);
648        }
649
650        let (base, qual) = builder.call();
651        assert_eq!(base, b'C');
652        assert_eq!(qual, 50); // Quality capped by pre-UMI error rate
653        assert_eq!(builder.contributions(), 1000);
654        assert_eq!(builder.observations_for_base(b'A'), 0);
655        assert_eq!(builder.observations_for_base(b'C'), 1000);
656        assert_eq!(builder.observations_for_base(b'G'), 0);
657        assert_eq!(builder.observations_for_base(b'T'), 0);
658
659        // Add 10 T's - should still call C
660        for _ in 0..10 {
661            builder.add(b'T', 20);
662        }
663
664        let (base, qual) = builder.call();
665        assert_eq!(base, b'C');
666        assert_eq!(qual, 50);
667        assert_eq!(builder.contributions(), 1010);
668        assert_eq!(builder.observations_for_base(b'T'), 10);
669    }
670
671    // Port of fgbio test: "calculate consensus base and quality given conflicting evidence"
672    #[test]
673    fn test_conflicting_evidence() {
674        let mut builder = ConsensusBaseBuilder::new(50, 50);
675
676        builder.add(b'A', 30);
677        builder.add(b'C', 28);
678
679        let (base, qual) = builder.call();
680        assert_eq!(base, b'A');
681        assert!(qual <= 5, "Quality should be low due to conflicting evidence, got {qual}");
682    }
683
684    // Port of fgbio test: "support calling multiple pileups from the same builder"
685    #[test]
686    fn test_reset_and_reuse_fgbio() {
687        let mut builder = ConsensusBaseBuilder::new(50, 50);
688
689        builder.add(b'A', 20);
690        let (base, qual) = builder.call();
691        assert_eq!(base, b'A');
692        assert_eq!(qual, 20);
693        assert_eq!(builder.contributions(), 1);
694
695        builder.reset();
696
697        builder.add(b'C', 20);
698        let (base, qual) = builder.call();
699        assert_eq!(base, b'C');
700        assert_eq!(qual, 20);
701        assert_eq!(builder.contributions(), 1);
702    }
703
704    // Test that Kahan summation produces consistent results regardless of addition order
705    // This tests the numeric stability improvement from PR #1120
706    #[test]
707    fn test_kahan_summation_order_independence() {
708        // Create two builders with the same error rates
709        let mut builder1 = ConsensusBaseBuilder::new(45, 40);
710        let mut builder2 = ConsensusBaseBuilder::new(45, 40);
711
712        // Add bases in different orders with varying qualities
713        // Builder 1: A's first, then C's
714        for q in [10, 20, 30, 40, 50] {
715            builder1.add(b'A', q);
716        }
717        for q in [15, 25, 35, 45] {
718            builder1.add(b'C', q);
719        }
720
721        // Builder 2: Interleaved
722        builder2.add(b'A', 10);
723        builder2.add(b'C', 15);
724        builder2.add(b'A', 20);
725        builder2.add(b'C', 25);
726        builder2.add(b'A', 30);
727        builder2.add(b'C', 35);
728        builder2.add(b'A', 40);
729        builder2.add(b'C', 45);
730        builder2.add(b'A', 50);
731
732        // Both should produce the same result
733        let (base1, qual1) = builder1.call();
734        let (base2, qual2) = builder2.call();
735
736        assert_eq!(base1, base2, "Consensus base should be order-independent");
737        assert_eq!(qual1, qual2, "Consensus quality should be order-independent");
738        assert_eq!(base1, b'A', "Should call A (5 vs 4 observations)");
739    }
740
741    // Test that Kahan summation handles extreme quality differences without precision loss
742    #[test]
743    fn test_kahan_summation_extreme_quality_range() {
744        let mut builder = ConsensusBaseBuilder::new(45, 40);
745
746        // Add many low quality bases followed by a few high quality bases
747        // Q93 provides MUCH more evidence than Q2 (error rates: ~5e-10 vs ~0.63)
748        // So 10 Q93 observations easily outweigh 100 Q2 observations
749        for _ in 0..100 {
750            builder.add(b'A', 2); // Very low quality (error ~63%)
751        }
752        for _ in 0..10 {
753            builder.add(b'C', 93); // Maximum quality (error ~5e-10)
754        }
755
756        // High quality C's should win - this tests that extreme values are handled correctly
757        let (base, _qual) = builder.call();
758        assert_eq!(base, b'C', "High-quality observations provide much stronger evidence");
759    }
760
761    // Port of fgbio test: "scale base qualities using the post-umi error rate"
762    // Tests that ConsensusBaseBuilder properly adjusts qualities based on post-UMI error rate.
763    // With errorRatePostLabeling=10 (Q10), input qualities should be reduced:
764    // Q20 → Q9, Q15 → Q8, Q10 → Q7, Q5 → Q4
765    #[test]
766    fn test_scale_base_qualities_post_umi_error_rate() {
767        // errorRatePreLabeling=MAX_PHRED (no pre-UMI adjustment), errorRatePostLabeling=10
768        // This matches fgbio's test setup
769        let input_quals: [u8; 4] = [20, 15, 10, 5];
770        let expected_output: [u8; 4] = [9, 8, 7, 4];
771
772        for (input_q, expected_q) in input_quals.iter().zip(expected_output.iter()) {
773            let mut builder = ConsensusBaseBuilder::new(MAX_PHRED, 10);
774            builder.add(b'A', *input_q);
775            let (_base, qual) = builder.call();
776
777            // Allow for small differences due to implementation details
778            // but the general pattern should hold: qualities are reduced
779            assert!(
780                qual <= *input_q,
781                "Q{input_q} should be reduced by post-UMI error rate, got Q{qual}"
782            );
783            assert!(
784                (i16::from(qual) - i16::from(*expected_q)).abs() <= 1,
785                "Q{input_q} should become ~Q{expected_q}, got Q{qual}"
786            );
787        }
788    }
789}
fgumi_consensus/base_builder.rs

fgumi_consensus/
base_builder.rs