genemancer 0.2.5

![genemancer banner](https://docs.rs/crate/genemancer/0.2.0/source/images/genemancer_logo2.png)

# Genemancer

![Rust](https://img.shields.io/badge/Rust-edition%202024-orange?logo=rust)
![Version](https://img.shields.io/badge/version-0.2.5-blue)
![Noodles](https://img.shields.io/badge/noodles-powered-5c7cfa)
![WGPU](https://img.shields.io/badge/GPU-wgpu-2b8a3e)
![Status](https://img.shields.io/badge/commands-9%20available-brightgreen)
![Lifecycle](https://img.shields.io/badge/lifecycle-experimental-orange)

Genemancer is a Rust CLI toolkit for genomics file processing, built primarily on the `noodles` ecosystem, with optional GPU acceleration (`wgpu` or CUDA) for target-based variant aggregation.

## Toolkit

Current subcommands:

- `merge-bam` (implemented): merge multiple coordinate-sorted, indexed BAM files into one BAM, with optional BED filtering (`all|strict|trim`), read-group filtering, output index writing, and configurable compression level.
- `gff-to-gtf` (implemented): convert GFF3 annotations to GTF (stdin/stdout supported).
- `gtf-to-introns` (implemented): extract transcript intron intervals from GTF annotations and write GFF3 output, with `.gtf.gz` input support and reference-script-style default output naming. Current behavior derives transcript exon-gap introns; full Bioconductor `intronicParts()` parity is still pending for complex overlapping transcript models.
- `call-targets` (implemented): call simple SNVs from BAM inputs over BED target intervals and write bgzipped VCF output (`.vcf.gz`) with index (`csi` default, optional `tbi`).
- `call-targets-gpu` (implemented): same pipeline as `call-targets`, but attempts GPU initialization and falls back to CPU unless `--require-gpu` is set.
- `split-bam` (implemented): split one or more coordinate-sorted BAM files into per-region BAMs from a BED file, with optional unassigned-read output and optional output indexing.
- `pod5` (implemented): namespace for POD5 operations exposed as `genemancer pod5 <operation>` (`validate` and `subsample` are implemented; `inspect` is scaffolded).
- `vcf` (in progress): namespace for VCF comparison workflows; `genemancer vcf diff` currently validates multisample-vs-multi-file input semantics and set definitions, but record loading and set-difference computation are still scaffolded.
- `cnloh` (in progress): SubChrom-inspired CNV/cnLOH detect pipeline with marker-filtered variant evidence, BAM coverage summary, and a single chromosome-colored `CNV.png` plot.

Global options:

- `-v/--verbose` (repeatable) for log verbosity.
- `-t/--threads` to control worker threads.
- `--log-file <FILE>` to mirror stderr logs to a file.

## Installation

Install from crates.io:

```bash
cargo install genemancer
```

Or install from the local repository checkout:

```bash
cargo install --path .
```

## Build And Run From Source

1. Install a Rust toolchain with edition 2024 support.
2. Build:
   ```bash
   cargo build
   ```
3. Show CLI help:
   ```bash
   cargo run -- --help
   ```

You can inspect any command with:

```bash
cargo run -- <subcommand> --help
```

## Installed Binary Usage

After installing with `cargo install genemancer` or `cargo install --path .`, run:

```bash
genemancer --help
genemancer <subcommand> --help
```

If `$HOME/.cargo/bin` is not on your `PATH`, use:

```bash
~/.cargo/bin/genemancer --help
```

## Usage Examples

Examples below assume you provide your own inputs and use a locally installed binary. In this repository, `*.bam` and `/test_data` are gitignored.
If you are running from source instead, prefix with `cargo run --` (or `cargo run --features cuda --` for CUDA-enabled builds).

Merge two BAMs into one BAM with index output:

```bash
genemancer merge-bam \
  -i /path/to/input1.bam \
  -i /path/to/input2.bam \
  -o test_data/merged.bam \
  --index
```

Convert GFF3 to GTF:

```bash
genemancer gff-to-gtf \
  -i input.gff3 \
  -o output.gtf
```

Extract introns from a GTF or gzipped GTF:

```bash
genemancer gtf-to-introns /path/to/hg38.ncbiRefSeq.gtf.gz
# writes /path/to/hg38.ncbiRefSeq.introns.gff by default
```

Call SNVs on target regions (CPU/streaming path):

```bash
genemancer call-targets \
  -i /path/to/bams_or_directory \
  -r /path/to/reference.fa.gz \
  -T /path/to/targets.bed \
  --rg-map references/rg_map.txt \
  -o test_data/out.vcf.gz
```

Run the GPU-enabled path (falls back to CPU by default):

```bash
genemancer call-targets-gpu \
  -i /path/to/bams_or_directory \
  -r /path/to/reference.fa.gz \
  -T /path/to/targets.bed \
  --rg-map references/rg_map.txt \
  --gpu-backend auto \
  -o test_data/out.vcf.gz
```

Run cnLOH/CNV detect with a single colored CNV plot:

```bash
genemancer cnloh detect \
  --sample sample_01 \
  --bam /path/to/sample_01_lane1.bam /path/to/sample_01_lane2.bam \
  --vcf /path/to/sample_01.vcf.gz \
  --vcf-sample sample_01 \
  --data-type WGS \
  --reference /path/to/reference.fa \
  --panel-bin WGS \
  --marker-dir /path/to/SNPmarker \
  --log-output test_data/cnloh/sample_01.cnloh.log \
  --plots true \
  --output test_data/cnloh
```

Get the SubChrom-compatible SNP marker databases (hg38/hg19):

```bash
curl -o SNPmarker_hg38.zip "https://zenodo.org/records/10155688/files/SNPmarker_hg38.zip?download=1" && \
  unzip SNPmarker_hg38.zip && rm SNPmarker_hg38.zip

curl -o SNPmarker_hg19.zip "https://zenodo.org/records/10155688/files/SNPmarker_hg19.zip?download=1" && \
  unzip SNPmarker_hg19.zip && rm SNPmarker_hg19.zip
```

Dockerfile form:

```dockerfile
RUN curl -o SNPmarker_hg38.zip https://zenodo.org/records/10155688/files/SNPmarker_hg38.zip?download=1 && \
    unzip SNPmarker_hg38.zip && rm SNPmarker_hg38.zip
RUN curl -o SNPmarker_hg19.zip https://zenodo.org/records/10155688/files/SNPmarker_hg19.zip?download=1 && \
    unzip SNPmarker_hg19.zip && rm SNPmarker_hg19.zip
```

`cnloh detect` defaults to canonical chromosomes (`chr1-22`, `chrX`, `chrY`) for output/plot rows.
Use `--include-noncanonical` to keep all contigs.

Build/install with CUDA support and force CUDA backend:

```bash
cargo install --path . --features cuda --force
genemancer call-targets-gpu \
  -i /path/to/bams_or_directory \
  -r /path/to/reference.fa.gz \
  -T /path/to/targets.bed \
  --rg-map references/rg_map.txt \
  --gpu-backend cuda \
  --cuda-device 0 \
  --require-gpu \
  -o test_data/out.vcf.gz
```

Tune GPU behavior explicitly (optional overrides on top of auto/hybrid tuning):

```bash
genemancer call-targets-gpu \
  -i /path/to/bams_or_directory \
  -r /path/to/reference.fa.gz \
  -T /path/to/targets.bed \
  --rg-map references/rg_map.txt \
  --gpu-backend auto \
  --tuning-mode hybrid \
  --tuning-profile throughput \
  --tuning-scale-percent 120 \
  --wgpu-matrix-utilization-percent 96 \
  --wgpu-upload-utilization-percent 98 \
  --max-obs-upload 64000000 \
  --stream-matrix-budget-mib 1024 \
  --defer-cuda-aggregation \
  -o test_data/out.vcf.gz
```

Split multiple BAMs by BED regions into an output folder:

```bash
genemancer split-bam \
  -i /path/to/input1.bam \
  -i /path/to/input2.bam \
  --bed /path/to/targets.bed \
  --out-dir test_data/splits \
  --output-prefix panel \
  --write-indices \
  --unassigned test_data/splits/unassigned.bam
```

Run POD5 operations:

```bash
genemancer pod5 inspect -i /path/to/reads.pod5
genemancer pod5 validate -i /path/to/reads.pod5
genemancer pod5 subsample \
  --input /path/to/run_a.pod5 \
  /path/to/run_b.pod5 \
  --percent 10 \
  --output /path/to/subsampled_outputs
```

`pod5 subsample` accepts both repeated and multi-value `--input`, so shell glob expansion works:
`--input *.pod5`.

If the POD5 shared library is not auto-detected in your environment, set:

```bash
export GENEMANCER_POD5_LIB=/path/to/lib_pod5/pod5_format_pybind*.so
```

## Repository Data

- `references`: helper scripts and a tracked sample RG map (`references/rg_map.txt`).
- `tests/data`: tracked `.bai` files only.
- Local working datasets are expected under `test_data/` (ignored by git).

## Ignored Paths

## Notes

- `call-targets` may prepare a sorted/indexed BGZF FASTA companion (`*.sorted.fa.gz` plus indexes) when the provided reference is not already in an indexed form suitable for random access.

## cnloh Current State

- Variant evidence input precedence is `--vcf` > `--snp` > BAM marker-site pileup (with a warning when both `--vcf` and `--snp` are provided).
- Multiple `--bam` inputs are aggregated as one sample in v1.
- BAM scanning is multithreaded and deterministic in merged outputs.
- Read filtering defaults skip duplicate, secondary, and supplementary alignments (opt-in include flags are available).
- Plot generation emits one combined chromosome-colored CNV/cnLOH figure (`*.CNV.png`) with coverage/CN/cnLOH panels.
- Marker filtering is strict in v1; when marker filtering yields zero variants, `cnloh detect` exits with an error.
- Canonical-chromosome filtering is enabled by default; use `--include-noncanonical` to disable it.
- `--variant-mode broad` and `--sample-mode rg` are scaffolded for later versions but not implemented in v1.

## TODO

| Area | Task | Status | Notes |
| --- | --- | --- | --- |
| `split-bam` | Add end-to-end fixture coverage for overlap edge cases | TODO | Validate multi-overlap and boundary behavior |
| `call-targets` | Add end-to-end integration tests on small fixture set | TODO | Validate VCF content + index generation |
| `call-targets` | Reject invalid BED intervals (`end <= start`) with a hard error | TODO | Current loader silently skips these rows |
| `call-targets-gpu` | Expand GPU backend validation matrix | TODO | Cover Vulkan/Metal/DX12 fallback behavior |
| `call-targets-gpu` | Honor `--threads` for scan worker fanout | TODO | Current streaming path uses one worker per input BAM |
| `merge-bam` | Add CRAM input/output support | TODO | Current implementation is BAM-focused |
| `merge-bam` | Align `--index-path` docs with implementation or add CSI writing | TODO | CLI/docs say BAI-or-CSI but writer is BAI-only |
| `merge-bam` | Reject zero-length BED intervals (`end == start`) | TODO | Current validation only rejects `end < start` |
| `gtf-to-introns` | Align Rust intron extraction semantics with Bioconductor `intronicParts()` | TODO | Current implementation emits transcript exon-gap introns; needs fixture-level parity checks for overlapping isoforms/shared exons |
| `cnloh` | Chunk 1: Freeze SubChrom parity spec + fixture baselines | DONE | See `docs/cnloh_parity_spec.md`, `docs/cnloh_baseline_fixtures.md`, and `tests/cnloh_parity_baseline.rs` |
| `cnloh` | Chunk 2: Add SubChrom-like marker/VAF preprocessing (`minCOV`, `minMAC`) | DONE | Emits `vaf_preprocessed.tsv` + `marker_chrom_stats.tsv` and summary keys |
| `cnloh` | Chunk 3: Implement VAF/MAF + ROH segmentation parity passes | DONE | Emits `maf.tsv`, `vaf_segments.tsv`, `roh_segments.tsv`, and `vaf_roh_segments.tsv` |
| `cnloh` | Chunk 4: Align coverage segmentation merge rules with VAF/ROH segments | DONE | Emits `coverage_vaf_segments.tsv` and uses unified coverage+VAF marker gating in event filtering |
| `cnloh` | Chunk 5: Align event classification + TF estimation with SubChrom intent | TODO | Include allele-specific event summary fields and tolerance-based tests |
| `cnloh` | Chunk 6: Finalize CNV visualization parity and end-to-end validation | TODO | Snapshot plot metadata and add parity regression tests |
| `cnloh` | Refactor monolithic `cnloh` implementation into smaller modules | TODO | Split `src/cnloh.rs` into focused units (I/O, preprocessing, segmentation, events, plotting) with clear interfaces |
| `cnloh` | Improve inline comments and developer-facing documentation | TODO | Add targeted code comments, module docs, and output-file docs for maintainability |
| `cnloh` | Implement `--variant-mode broad` | TODO | CLI mode is scaffolded but currently hard-fails |
| `cnloh` | Implement RG-aware `--sample-mode rg` workflow | TODO | v1 aggregates all BAMs as one sample |
| `cnloh` | Add fixture-level integration tests for strict marker-overlap behavior | TODO | Validate hard-fail path when marker overlap is zero |
| `cnloh` | Add event-level segmentation/calling outputs (cnLOH/CN event table) | TODO | Current output is coverage/variant summaries + plot |
| `cnloh` | Expand marker database fixtures beyond minimal toy set | TODO | Improves realistic plot coverage in tracked tests |
| Docs | Add example outputs and expected file artifacts per command | TODO | Make quick verification easier for users |