bbnorm-rs 0.1.3

Rust implementation of BBTools BBNorm-style read depth normalization
Documentation

bbnorm-rs

bbnorm-rs is a Rust implementation of the practical BBNorm read-depth normalization workflow from BBTools. It focuses on local FASTA/FASTQ normalization, histogram output, paired/interleaved routing, bounded memory counting, and reproducible Java-parity behavior for covered modes.

This is an early working release, not a complete BBTools replacement. The Git repository includes a vendored BBTools snapshot for parity tests; crates.io packages intentionally exclude that snapshot to keep the package small.

Install

From crates.io, once published:

cargo install bbnorm-rs

From a checkout:

cargo install --path .

Basic Usage

bbnorm-rs in=reads_R1.fq.gz in2=reads_R2.fq.gz \
  out=normalized_R1.fq.gz out2=normalized_R2.fq.gz \
  target=40 max=80 min=5 k=31 threads=8

Common outputs:

bbnorm-rs in=reads.fq.gz out=keep.fq.gz outt=toss.fq.gz \
  hist=depth.tsv rhist=read_depth.tsv peaksout=peaks.tsv

Supported inputs and outputs include plain and gzip FASTA/FASTQ, single-end, paired two-file, explicit interleaved, auto-interleaved, BBTools-style # paired filename expansion, null output sinks, and common BBNorm-style key=value aliases.

Current Status

Verified locally:

  • cargo fmt --all -- --check
  • cargo clippy --all-targets --all-features -- -D warnings
  • cargo test --all

Current tests cover 242 library tests, 8 basic integration tests, and 108 Java-parity tests against the vendored BBTools snapshot.

Implemented working areas include:

  • BBNorm-style CLI parsing for common normalization options and aliases.
  • Plain and gzip FASTA/FASTQ I/O.
  • Single-end, paired, and interleaved read routing.
  • Short canonical k-mers and long BBTools-shaped hashed k-mers.
  • Exact and bounded count-min counting paths.
  • Depth histograms, read-depth histograms, and peak output.
  • Deterministic normalization decisions for covered modes.
  • Multipass, count-up, and ECC behavior for tested subsets.
  • Guarded benchmark and parity harnesses in the source repository.

Known gaps remain:

  • Full BBTools sketch, prefilter, and cardinality/loglog collision parity is not complete.
  • ECC and overlap behavior is covered by compact and biological stress tests, but not every BBMerge/BBNorm edge case.
  • Large human-read benchmarks show improved deterministic bounded counting and excellent memory usage, but input counting remains the main speed bottleneck in some comparable modes.

See docs/parity.md and docs/component_buildout.md for the detailed ledger. The acceptance matrix in docs/parity_matrix.md is the current front-door workflow for deciding whether a mode is exact parity, bounded approximate parity, accepted Rust-over-Java divergence, or still a gap.

Benchmark Snapshot

The acceptance matrix is the publishable benchmark source of truth. The latest local matrix run at tmp/parity_acceptance_publish_ready_20260508/acceptance_summary.tsv verified 9 bundled phiX exact-output modes and 6 local human bounded-sketch modes.

Exact bundled rows:

  • default, k=40, k=40 fixspikes=t
  • passes=2
  • keepall=t
  • ecc=t markuncorrectableerrors=t
  • qtrim=r trimq=10
  • minlen=100
  • passes=2 ecc=t markuncorrectableerrors=t

Local human bounded-sketch rows:

Row Mode Verdict Java time Rust time Java RSS Rust RSS Hist drift ppm Rhist drift ppm
50k default bounded_drift 1.54s 2.57s 3.35 GiB 3.30 GiB 4 840
50k prefilter bounded_drift 2.05s 2.68s 3.35 GiB 3.10 GiB 4 840
50k k40_fixspikes bounded_drift 1.64s 2.47s 3.34 GiB 3.12 GiB 2 140
500k default bounded_drift 9.03s 30.12s 3.38 GiB 3.38 GiB 1227 1492
500k prefilter bounded_drift 10.53s 31.81s 3.26 GiB 3.14 GiB 49 998
500k k40_fixspikes bounded_drift 10.82s 28.60s 3.39 GiB 3.25 GiB 245 982

The conservative publishable claim is: exact covered fixture modes match the vendored Java oracle byte-for-byte, bounded human rows stay within the matrix drift gate, and large-slice Rust speed still needs work in the input-counting hot path. countup=t is tracked separately as an accepted Rust-over-Java divergence guard rather than normal Java parity.

For high-throughput bounded approximate runs where byte-stable collision order is less important than speed, deterministic=f enables direct atomic packed sketch updates and fuses input histogram collection into the normalization pass. On the local 500k paired-human packed 16-bit lane at tmp/fastlane_atomic_packed_fusedhist_500k_compare_20260515_115057, the 3-repeat median was 6.769s / 2.79 GiB RSS for Rust versus 7.814s / 3.39 GiB RSS for Java. That is 13.4% faster wall time and 17.8% lower peak RSS on the same input, read limits, bits=16, null read outputs, hist, and rhist benchmark lane.

For repeatable current baselines, use scripts/benchmark_trustworthy_baseline.py. It records git/tool/input metadata, command lines, raw run data, stage timings, Java/Rust histogram drift, and aggregate p10/median/p90 summaries. See docs/trustworthy_benchmarking.md for the Java-default and packed 16-bit benchmark lanes.

The v0.1.3 performance patch adds the high-throughput bounded approximate fast lane above. The v0.1.2 performance patch improved deterministic packed bounded counting on the local 500k paired-human packed 16-bit lane. A final 3-repeat refresh at tmp/trustworthy_baseline_500k_bits16_final_20260508 measured the current deterministic Rust median at 19.786s wall time, 14.726s input counting, and 3.04 GiB RSS; the Java median for the same lane was 8.387s wall time and 3.41 GiB RSS.

Variant Before After Change
Rust deterministic wall time 24.104s 19.786s 17.9% faster
Rust deterministic input counting 19.038s 14.726s 22.7% faster
Rust deterministic max RSS 3.41 GiB 3.04 GiB 10.9% lower

Those measurements compare tmp/trustworthy_baseline_500k_bits16_20260508 against tmp/trustworthy_baseline_500k_bits16_final_20260508, with 3 repeats per variant, null read outputs, bits=16, reads=500000, and tablereads=500000.

Experimental GPU counting is documented in docs/gpu_counting_integration.md. The parity-safe GPU path must preserve deterministic chunk replay order; naive global GPU reduction is faster-looking but semantically wrong for conservative count-min updates. The current persistent CUDA helper is byte-identical to Rust CPU on the tested lanes but remains slower than the CPU path, so it is kept behind explicit gpucounting=t gpupersistent=t flags.

Repository Layout

  • src/: Rust library and CLI implementation.
  • tests/basic.rs: package-friendly integration tests.
  • tests/java_parity.rs: repository-only Java parity tests requiring vendor/BBTools-master.
  • docs/: implementation status and parity notes.
  • scripts/: local parity, benchmark, and stress harnesses.
  • vendor/: BBTools reference snapshot for repository testing only.

License

bbnorm-rs is licensed under the BSD 3-Clause License. The vendored BBTools reference snapshot in the source repository is distributed under its own license at vendor/BBTools-master/license.txt and is not included in crates.io packages.