bbnorm-rs 0.1.3

Rust implementation of BBTools BBNorm-style read depth normalization
Documentation
# bbnorm-rs

`bbnorm-rs` is a Rust implementation of the practical BBNorm read-depth
normalization workflow from BBTools. It focuses on local FASTA/FASTQ
normalization, histogram output, paired/interleaved routing, bounded memory
counting, and reproducible Java-parity behavior for covered modes.

This is an early working release, not a complete BBTools replacement. The Git
repository includes a vendored BBTools snapshot for parity tests; crates.io
packages intentionally exclude that snapshot to keep the package small.

## Install

From crates.io, once published:

```bash
cargo install bbnorm-rs
```

From a checkout:

```bash
cargo install --path .
```

## Basic Usage

```bash
bbnorm-rs in=reads_R1.fq.gz in2=reads_R2.fq.gz \
  out=normalized_R1.fq.gz out2=normalized_R2.fq.gz \
  target=40 max=80 min=5 k=31 threads=8
```

Common outputs:

```bash
bbnorm-rs in=reads.fq.gz out=keep.fq.gz outt=toss.fq.gz \
  hist=depth.tsv rhist=read_depth.tsv peaksout=peaks.tsv
```

Supported inputs and outputs include plain and gzip FASTA/FASTQ, single-end,
paired two-file, explicit interleaved, auto-interleaved, BBTools-style `#`
paired filename expansion, `null` output sinks, and common BBNorm-style
`key=value` aliases.

## Current Status

Verified locally:

- `cargo fmt --all -- --check`
- `cargo clippy --all-targets --all-features -- -D warnings`
- `cargo test --all`

Current tests cover 242 library tests, 8 basic integration tests, and 108
Java-parity tests against the vendored BBTools snapshot.

Implemented working areas include:

- BBNorm-style CLI parsing for common normalization options and aliases.
- Plain and gzip FASTA/FASTQ I/O.
- Single-end, paired, and interleaved read routing.
- Short canonical k-mers and long BBTools-shaped hashed k-mers.
- Exact and bounded count-min counting paths.
- Depth histograms, read-depth histograms, and peak output.
- Deterministic normalization decisions for covered modes.
- Multipass, count-up, and ECC behavior for tested subsets.
- Guarded benchmark and parity harnesses in the source repository.

Known gaps remain:

- Full BBTools sketch, prefilter, and cardinality/loglog collision parity is
  not complete.
- ECC and overlap behavior is covered by compact and biological stress tests,
  but not every BBMerge/BBNorm edge case.
- Large human-read benchmarks show improved deterministic bounded counting and
  excellent memory usage, but input counting remains the main speed bottleneck
  in some comparable modes.

See [docs/parity.md](docs/parity.md) and
[docs/component_buildout.md](docs/component_buildout.md) for the detailed
ledger. The acceptance matrix in
[docs/parity_matrix.md](docs/parity_matrix.md) is the current front-door
workflow for deciding whether a mode is exact parity, bounded approximate
parity, accepted Rust-over-Java divergence, or still a gap.

## Benchmark Snapshot

The acceptance matrix is the publishable benchmark source of truth. The latest
local matrix run at
`tmp/parity_acceptance_publish_ready_20260508/acceptance_summary.tsv` verified
9 bundled phiX exact-output modes and 6 local human bounded-sketch modes.

Exact bundled rows:

- `default`, `k=40`, `k=40 fixspikes=t`
- `passes=2`
- `keepall=t`
- `ecc=t markuncorrectableerrors=t`
- `qtrim=r trimq=10`
- `minlen=100`
- `passes=2 ecc=t markuncorrectableerrors=t`

Local human bounded-sketch rows:

| Row | Mode | Verdict | Java time | Rust time | Java RSS | Rust RSS | Hist drift ppm | Rhist drift ppm |
| --- | --- | --- | ---: | ---: | ---: | ---: | ---: | ---: |
| 50k | default | bounded_drift | 1.54s | 2.57s | 3.35 GiB | 3.30 GiB | 4 | 840 |
| 50k | prefilter | bounded_drift | 2.05s | 2.68s | 3.35 GiB | 3.10 GiB | 4 | 840 |
| 50k | k40_fixspikes | bounded_drift | 1.64s | 2.47s | 3.34 GiB | 3.12 GiB | 2 | 140 |
| 500k | default | bounded_drift | 9.03s | 30.12s | 3.38 GiB | 3.38 GiB | 1227 | 1492 |
| 500k | prefilter | bounded_drift | 10.53s | 31.81s | 3.26 GiB | 3.14 GiB | 49 | 998 |
| 500k | k40_fixspikes | bounded_drift | 10.82s | 28.60s | 3.39 GiB | 3.25 GiB | 245 | 982 |

The conservative publishable claim is: exact covered fixture modes match the
vendored Java oracle byte-for-byte, bounded human rows stay within the matrix
drift gate, and large-slice Rust speed still needs work in the input-counting
hot path. `countup=t` is tracked separately as an accepted Rust-over-Java
divergence guard rather than normal Java parity.

For high-throughput bounded approximate runs where byte-stable collision order
is less important than speed, `deterministic=f` enables direct atomic packed
sketch updates and fuses input histogram collection into the normalization pass.
On the local 500k paired-human packed 16-bit lane at
`tmp/fastlane_atomic_packed_fusedhist_500k_compare_20260515_115057`, the
3-repeat median was 6.769s / 2.79 GiB RSS for Rust versus 7.814s / 3.39 GiB
RSS for Java. That is 13.4% faster wall time and 17.8% lower peak RSS on the
same input, read limits, `bits=16`, null read outputs, `hist`, and `rhist`
benchmark lane.

For repeatable current baselines, use
[`scripts/benchmark_trustworthy_baseline.py`](scripts/benchmark_trustworthy_baseline.py).
It records git/tool/input metadata, command lines, raw run data, stage timings,
Java/Rust histogram drift, and aggregate p10/median/p90 summaries. See
[`docs/trustworthy_benchmarking.md`](docs/trustworthy_benchmarking.md) for the
Java-default and packed 16-bit benchmark lanes.

The `v0.1.3` performance patch adds the high-throughput bounded approximate
fast lane above. The `v0.1.2` performance patch improved deterministic packed
bounded counting on the local 500k paired-human packed 16-bit lane. A final
3-repeat refresh at
`tmp/trustworthy_baseline_500k_bits16_final_20260508` measured the current
deterministic Rust median at 19.786s wall time, 14.726s input counting, and
3.04 GiB RSS; the Java median for the same lane was 8.387s wall time and
3.41 GiB RSS.

| Variant | Before | After | Change |
| --- | ---: | ---: | ---: |
| Rust deterministic wall time | 24.104s | 19.786s | 17.9% faster |
| Rust deterministic input counting | 19.038s | 14.726s | 22.7% faster |
| Rust deterministic max RSS | 3.41 GiB | 3.04 GiB | 10.9% lower |

Those measurements compare `tmp/trustworthy_baseline_500k_bits16_20260508`
against `tmp/trustworthy_baseline_500k_bits16_final_20260508`, with 3 repeats
per variant, null read outputs, `bits=16`, `reads=500000`, and
`tablereads=500000`.

Experimental GPU counting is documented in
[`docs/gpu_counting_integration.md`](docs/gpu_counting_integration.md). The
parity-safe GPU path must preserve deterministic chunk replay order; naive
global GPU reduction is faster-looking but semantically wrong for conservative
count-min updates. The current persistent CUDA helper is byte-identical to Rust
CPU on the tested lanes but remains slower than the CPU path, so it is kept
behind explicit `gpucounting=t gpupersistent=t` flags.

## Repository Layout

- `src/`: Rust library and CLI implementation.
- `tests/basic.rs`: package-friendly integration tests.
- `tests/java_parity.rs`: repository-only Java parity tests requiring
  `vendor/BBTools-master`.
- `docs/`: implementation status and parity notes.
- `scripts/`: local parity, benchmark, and stress harnesses.
- `vendor/`: BBTools reference snapshot for repository testing only.

## License

`bbnorm-rs` is licensed under the BSD 3-Clause License. The vendored BBTools
reference snapshot in the source repository is distributed under its own license
at `vendor/BBTools-master/license.txt` and is not included in crates.io
packages.