
# Genemancer






Genemancer is a Rust CLI toolkit for genomics file processing, built primarily on the `noodles` ecosystem, with optional GPU acceleration (`wgpu` or CUDA) for target-based variant aggregation.
## Toolkit
Current subcommands:
- `merge-bam` (implemented): merge multiple coordinate-sorted, indexed BAM files into one BAM, with optional BED filtering (`all|strict|trim`), read-group filtering, output index writing, and configurable compression level.
- `gff-to-gtf` (implemented): convert GFF3 annotations to GTF (stdin/stdout supported).
- `gtf-to-introns` (implemented): extract transcript intron intervals from GTF annotations and write GFF3 output, with `.gtf.gz` input support and reference-script-style default output naming. Current behavior derives transcript exon-gap introns; full Bioconductor `intronicParts()` parity is still pending for complex overlapping transcript models.
- `call-targets` (implemented): call simple SNVs from BAM inputs over BED target intervals and write bgzipped VCF output (`.vcf.gz`) with index (`csi` default, optional `tbi`).
- `call-targets-gpu` (implemented): same pipeline as `call-targets`, but attempts GPU initialization and falls back to CPU unless `--require-gpu` is set.
- `split-bam` (implemented): split one or more coordinate-sorted BAM files into per-region BAMs from a BED file, with optional unassigned-read output and optional output indexing.
- `pod5` (implemented): namespace for POD5 operations exposed as `genemancer pod5 <operation>` (`validate` and `subsample` are implemented; `inspect` is scaffolded).
- `vcf` (in progress): namespace for VCF comparison workflows; `genemancer vcf diff` currently validates multisample-vs-multi-file input semantics and set definitions, but record loading and set-difference computation are still scaffolded.
- `cnloh` (in progress): SubChrom-inspired CNV/cnLOH detect pipeline with marker-filtered variant evidence, BAM coverage summary, and a single chromosome-colored `CNV.png` plot.
Global options:
- `-v/--verbose` (repeatable) for log verbosity.
- `-t/--threads` to control worker threads.
- `--log-file <FILE>` to mirror stderr logs to a file.
## Installation
Install from crates.io:
```bash
cargo install genemancer
```
Or install from the local repository checkout:
```bash
cargo install --path .
```
## Build And Run From Source
1. Install a Rust toolchain with edition 2024 support.
2. Build:
```bash
cargo build
```
3. Show CLI help:
```bash
cargo run -- --help
```
You can inspect any command with:
```bash
cargo run -- <subcommand> --help
```
## Installed Binary Usage
After installing with `cargo install genemancer` or `cargo install --path .`, run:
```bash
genemancer --help
genemancer <subcommand> --help
```
If `$HOME/.cargo/bin` is not on your `PATH`, use:
```bash
~/.cargo/bin/genemancer --help
```
## Usage Examples
Examples below assume you provide your own inputs and use a locally installed binary. In this repository, `*.bam` and `/test_data` are gitignored.
If you are running from source instead, prefix with `cargo run --` (or `cargo run --features cuda --` for CUDA-enabled builds).
Merge two BAMs into one BAM with index output:
```bash
genemancer merge-bam \
-i /path/to/input1.bam \
-i /path/to/input2.bam \
-o test_data/merged.bam \
--index
```
Convert GFF3 to GTF:
```bash
genemancer gff-to-gtf \
-i input.gff3 \
-o output.gtf
```
Extract introns from a GTF or gzipped GTF:
```bash
genemancer gtf-to-introns /path/to/hg38.ncbiRefSeq.gtf.gz
# writes /path/to/hg38.ncbiRefSeq.introns.gff by default
```
Call SNVs on target regions (CPU/streaming path):
```bash
genemancer call-targets \
-i /path/to/bams_or_directory \
-r /path/to/reference.fa.gz \
-T /path/to/targets.bed \
--rg-map references/rg_map.txt \
-o test_data/out.vcf.gz
```
Run the GPU-enabled path (falls back to CPU by default):
```bash
genemancer call-targets-gpu \
-i /path/to/bams_or_directory \
-r /path/to/reference.fa.gz \
-T /path/to/targets.bed \
--rg-map references/rg_map.txt \
--gpu-backend auto \
-o test_data/out.vcf.gz
```
Run cnLOH/CNV detect with a single colored CNV plot:
```bash
genemancer cnloh detect \
--sample sample_01 \
--bam /path/to/sample_01_lane1.bam /path/to/sample_01_lane2.bam \
--vcf /path/to/sample_01.vcf.gz \
--vcf-sample sample_01 \
--data-type WGS \
--reference /path/to/reference.fa \
--panel-bin WGS \
--marker-dir /path/to/SNPmarker \
--log-output test_data/cnloh/sample_01.cnloh.log \
--plots true \
--output test_data/cnloh
```
Get the SubChrom-compatible SNP marker databases (hg38/hg19):
```bash
curl -o SNPmarker_hg38.zip "https://zenodo.org/records/10155688/files/SNPmarker_hg38.zip?download=1" && \
unzip SNPmarker_hg38.zip && rm SNPmarker_hg38.zip
curl -o SNPmarker_hg19.zip "https://zenodo.org/records/10155688/files/SNPmarker_hg19.zip?download=1" && \
unzip SNPmarker_hg19.zip && rm SNPmarker_hg19.zip
```
Dockerfile form:
```dockerfile
RUN curl -o SNPmarker_hg38.zip https://zenodo.org/records/10155688/files/SNPmarker_hg38.zip?download=1 && \
unzip SNPmarker_hg38.zip && rm SNPmarker_hg38.zip
RUN curl -o SNPmarker_hg19.zip https://zenodo.org/records/10155688/files/SNPmarker_hg19.zip?download=1 && \
unzip SNPmarker_hg19.zip && rm SNPmarker_hg19.zip
```
`cnloh detect` defaults to canonical chromosomes (`chr1-22`, `chrX`, `chrY`) for output/plot rows.
Use `--include-noncanonical` to keep all contigs.
Build/install with CUDA support and force CUDA backend:
```bash
cargo install --path . --features cuda --force
genemancer call-targets-gpu \
-i /path/to/bams_or_directory \
-r /path/to/reference.fa.gz \
-T /path/to/targets.bed \
--rg-map references/rg_map.txt \
--gpu-backend cuda \
--cuda-device 0 \
--require-gpu \
-o test_data/out.vcf.gz
```
Tune GPU behavior explicitly (optional overrides on top of auto/hybrid tuning):
```bash
genemancer call-targets-gpu \
-i /path/to/bams_or_directory \
-r /path/to/reference.fa.gz \
-T /path/to/targets.bed \
--rg-map references/rg_map.txt \
--gpu-backend auto \
--tuning-mode hybrid \
--tuning-profile throughput \
--tuning-scale-percent 120 \
--wgpu-matrix-utilization-percent 96 \
--wgpu-upload-utilization-percent 98 \
--max-obs-upload 64000000 \
--stream-matrix-budget-mib 1024 \
--defer-cuda-aggregation \
-o test_data/out.vcf.gz
```
Split multiple BAMs by BED regions into an output folder:
```bash
genemancer split-bam \
-i /path/to/input1.bam \
-i /path/to/input2.bam \
--bed /path/to/targets.bed \
--out-dir test_data/splits \
--output-prefix panel \
--write-indices \
--unassigned test_data/splits/unassigned.bam
```
Run POD5 operations:
```bash
genemancer pod5 inspect -i /path/to/reads.pod5
genemancer pod5 validate -i /path/to/reads.pod5
genemancer pod5 subsample \
--input /path/to/run_a.pod5 \
/path/to/run_b.pod5 \
--percent 10 \
--output /path/to/subsampled_outputs
```
`pod5 subsample` accepts both repeated and multi-value `--input`, so shell glob expansion works:
`--input *.pod5`.
If the POD5 shared library is not auto-detected in your environment, set:
```bash
export GENEMANCER_POD5_LIB=/path/to/lib_pod5/pod5_format_pybind*.so
```
## Repository Data
- `references`: helper scripts and a tracked sample RG map (`references/rg_map.txt`).
- `tests/data`: tracked `.bai` files only.
- Local working datasets are expected under `test_data/` (ignored by git).
## Ignored Paths
## Notes
- `call-targets` may prepare a sorted/indexed BGZF FASTA companion (`*.sorted.fa.gz` plus indexes) when the provided reference is not already in an indexed form suitable for random access.
## cnloh Current State
- Variant evidence input precedence is `--vcf` > `--snp` > BAM marker-site pileup (with a warning when both `--vcf` and `--snp` are provided).
- Multiple `--bam` inputs are aggregated as one sample in v1.
- BAM scanning is multithreaded and deterministic in merged outputs.
- Read filtering defaults skip duplicate, secondary, and supplementary alignments (opt-in include flags are available).
- Plot generation emits one combined chromosome-colored CNV/cnLOH figure (`*.CNV.png`) with coverage/CN/cnLOH panels.
- Marker filtering is strict in v1; when marker filtering yields zero variants, `cnloh detect` exits with an error.
- Canonical-chromosome filtering is enabled by default; use `--include-noncanonical` to disable it.
- `--variant-mode broad` and `--sample-mode rg` are scaffolded for later versions but not implemented in v1.
## TODO
| `split-bam` | Add end-to-end fixture coverage for overlap edge cases | TODO | Validate multi-overlap and boundary behavior |
| `call-targets` | Add end-to-end integration tests on small fixture set | TODO | Validate VCF content + index generation |
| `call-targets` | Reject invalid BED intervals (`end <= start`) with a hard error | TODO | Current loader silently skips these rows |
| `call-targets-gpu` | Expand GPU backend validation matrix | TODO | Cover Vulkan/Metal/DX12 fallback behavior |
| `call-targets-gpu` | Honor `--threads` for scan worker fanout | TODO | Current streaming path uses one worker per input BAM |
| `merge-bam` | Add CRAM input/output support | TODO | Current implementation is BAM-focused |
| `merge-bam` | Align `--index-path` docs with implementation or add CSI writing | TODO | CLI/docs say BAI-or-CSI but writer is BAI-only |
| `merge-bam` | Reject zero-length BED intervals (`end == start`) | TODO | Current validation only rejects `end < start` |
| `gtf-to-introns` | Align Rust intron extraction semantics with Bioconductor `intronicParts()` | TODO | Current implementation emits transcript exon-gap introns; needs fixture-level parity checks for overlapping isoforms/shared exons |
| `cnloh` | Chunk 1: Freeze SubChrom parity spec + fixture baselines | DONE | See `docs/cnloh_parity_spec.md`, `docs/cnloh_baseline_fixtures.md`, and `tests/cnloh_parity_baseline.rs` |
| `cnloh` | Chunk 2: Add SubChrom-like marker/VAF preprocessing (`minCOV`, `minMAC`) | DONE | Emits `vaf_preprocessed.tsv` + `marker_chrom_stats.tsv` and summary keys |
| `cnloh` | Chunk 3: Implement VAF/MAF + ROH segmentation parity passes | DONE | Emits `maf.tsv`, `vaf_segments.tsv`, `roh_segments.tsv`, and `vaf_roh_segments.tsv` |
| `cnloh` | Chunk 4: Align coverage segmentation merge rules with VAF/ROH segments | DONE | Emits `coverage_vaf_segments.tsv` and uses unified coverage+VAF marker gating in event filtering |
| `cnloh` | Chunk 5: Align event classification + TF estimation with SubChrom intent | TODO | Include allele-specific event summary fields and tolerance-based tests |
| `cnloh` | Chunk 6: Finalize CNV visualization parity and end-to-end validation | TODO | Snapshot plot metadata and add parity regression tests |
| `cnloh` | Refactor monolithic `cnloh` implementation into smaller modules | TODO | Split `src/cnloh.rs` into focused units (I/O, preprocessing, segmentation, events, plotting) with clear interfaces |
| `cnloh` | Improve inline comments and developer-facing documentation | TODO | Add targeted code comments, module docs, and output-file docs for maintainability |
| `cnloh` | Implement `--variant-mode broad` | TODO | CLI mode is scaffolded but currently hard-fails |
| `cnloh` | Implement RG-aware `--sample-mode rg` workflow | TODO | v1 aggregates all BAMs as one sample |
| `cnloh` | Add fixture-level integration tests for strict marker-overlap behavior | TODO | Validate hard-fail path when marker overlap is zero |
| `cnloh` | Add event-level segmentation/calling outputs (cnLOH/CN event table) | TODO | Current output is coverage/variant summaries + plot |
| `cnloh` | Expand marker database fixtures beyond minimal toy set | TODO | Improves realistic plot coverage in tracked tests |
| Docs | Add example outputs and expected file artifacts per command | TODO | Make quick verification easier for users |