genemancer-0.2.5 is not a library.

genemancer banner

Genemancer

Genemancer is a Rust CLI toolkit for genomics file processing, built primarily on the noodles ecosystem, with optional GPU acceleration (wgpu or CUDA) for target-based variant aggregation.

Toolkit

Current subcommands:

merge-bam (implemented): merge multiple coordinate-sorted, indexed BAM files into one BAM, with optional BED filtering (all|strict|trim), read-group filtering, output index writing, and configurable compression level.
gff-to-gtf (implemented): convert GFF3 annotations to GTF (stdin/stdout supported).
gtf-to-introns (implemented): extract transcript intron intervals from GTF annotations and write GFF3 output, with .gtf.gz input support and reference-script-style default output naming. Current behavior derives transcript exon-gap introns; full Bioconductor intronicParts() parity is still pending for complex overlapping transcript models.
call-targets (implemented): call simple SNVs from BAM inputs over BED target intervals and write bgzipped VCF output (.vcf.gz) with index (csi default, optional tbi).
call-targets-gpu (implemented): same pipeline as call-targets, but attempts GPU initialization and falls back to CPU unless --require-gpu is set.
split-bam (implemented): split one or more coordinate-sorted BAM files into per-region BAMs from a BED file, with optional unassigned-read output and optional output indexing.
pod5 (implemented): namespace for POD5 operations exposed as genemancer pod5 <operation> (validate and subsample are implemented; inspect is scaffolded).
vcf (in progress): namespace for VCF comparison workflows; genemancer vcf diff currently validates multisample-vs-multi-file input semantics and set definitions, but record loading and set-difference computation are still scaffolded.
cnloh (in progress): SubChrom-inspired CNV/cnLOH detect pipeline with marker-filtered variant evidence, BAM coverage summary, and a single chromosome-colored CNV.png plot.

Global options:

-v/--verbose (repeatable) for log verbosity.
-t/--threads to control worker threads.
--log-file <FILE> to mirror stderr logs to a file.

Installation

Install from crates.io:

cargo install genemancer

Or install from the local repository checkout:

cargo install --path .

Build And Run From Source

Install a Rust toolchain with edition 2024 support.
Build:
```
cargo build
```
Show CLI help:
```
cargo run -- --help
```

You can inspect any command with:

cargo run -- <subcommand> --help

Installed Binary Usage

After installing with cargo install genemancer or cargo install --path ., run:

genemancer --help
genemancer <subcommand> --help

If $HOME/.cargo/bin is not on your PATH, use:

~/.cargo/bin/genemancer --help

Usage Examples

Examples below assume you provide your own inputs and use a locally installed binary. In this repository, *.bam and /test_data are gitignored. If you are running from source instead, prefix with cargo run -- (or cargo run --features cuda -- for CUDA-enabled builds).

Merge two BAMs into one BAM with index output:

genemancer merge-bam \
  -i /path/to/input1.bam \
  -i /path/to/input2.bam \
  -o test_data/merged.bam \
  --index

Convert GFF3 to GTF:

genemancer gff-to-gtf \
  -i input.gff3 \
  -o output.gtf

Extract introns from a GTF or gzipped GTF:

genemancer gtf-to-introns /path/to/hg38.ncbiRefSeq.gtf.gz
# writes /path/to/hg38.ncbiRefSeq.introns.gff by default

Call SNVs on target regions (CPU/streaming path):

genemancer call-targets \
  -i /path/to/bams_or_directory \
  -r /path/to/reference.fa.gz \
  -T /path/to/targets.bed \
  --rg-map references/rg_map.txt \
  -o test_data/out.vcf.gz

Run the GPU-enabled path (falls back to CPU by default):

genemancer call-targets-gpu \
  -i /path/to/bams_or_directory \
  -r /path/to/reference.fa.gz \
  -T /path/to/targets.bed \
  --rg-map references/rg_map.txt \
  --gpu-backend auto \
  -o test_data/out.vcf.gz

Run cnLOH/CNV detect with a single colored CNV plot:

genemancer cnloh detect \
  --sample sample_01 \
  --bam /path/to/sample_01_lane1.bam /path/to/sample_01_lane2.bam \
  --vcf /path/to/sample_01.vcf.gz \
  --vcf-sample sample_01 \
  --data-type WGS \
  --reference /path/to/reference.fa \
  --panel-bin WGS \
  --marker-dir /path/to/SNPmarker \
  --log-output test_data/cnloh/sample_01.cnloh.log \
  --plots true \
  --output test_data/cnloh

Get the SubChrom-compatible SNP marker databases (hg38/hg19):

curl -o SNPmarker_hg38.zip "https://zenodo.org/records/10155688/files/SNPmarker_hg38.zip?download=1" && \
  unzip SNPmarker_hg38.zip && rm SNPmarker_hg38.zip

curl -o SNPmarker_hg19.zip "https://zenodo.org/records/10155688/files/SNPmarker_hg19.zip?download=1" && \
  unzip SNPmarker_hg19.zip && rm SNPmarker_hg19.zip

Dockerfile form:

RUN curl -o SNPmarker_hg38.zip https://zenodo.org/records/10155688/files/SNPmarker_hg38.zip?download=1 && \
    unzip SNPmarker_hg38.zip && rm SNPmarker_hg38.zip
RUN curl -o SNPmarker_hg19.zip https://zenodo.org/records/10155688/files/SNPmarker_hg19.zip?download=1 && \
    unzip SNPmarker_hg19.zip && rm SNPmarker_hg19.zip

cnloh detect defaults to canonical chromosomes (chr1-22, chrX, chrY) for output/plot rows. Use --include-noncanonical to keep all contigs.

Build/install with CUDA support and force CUDA backend:

cargo install --path . --features cuda --force
genemancer call-targets-gpu \
  -i /path/to/bams_or_directory \
  -r /path/to/reference.fa.gz \
  -T /path/to/targets.bed \
  --rg-map references/rg_map.txt \
  --gpu-backend cuda \
  --cuda-device 0 \
  --require-gpu \
  -o test_data/out.vcf.gz

Tune GPU behavior explicitly (optional overrides on top of auto/hybrid tuning):

genemancer call-targets-gpu \
  -i /path/to/bams_or_directory \
  -r /path/to/reference.fa.gz \
  -T /path/to/targets.bed \
  --rg-map references/rg_map.txt \
  --gpu-backend auto \
  --tuning-mode hybrid \
  --tuning-profile throughput \
  --tuning-scale-percent 120 \
  --wgpu-matrix-utilization-percent 96 \
  --wgpu-upload-utilization-percent 98 \
  --max-obs-upload 64000000 \
  --stream-matrix-budget-mib 1024 \
  --defer-cuda-aggregation \
  -o test_data/out.vcf.gz

Split multiple BAMs by BED regions into an output folder:

genemancer split-bam \
  -i /path/to/input1.bam \
  -i /path/to/input2.bam \
  --bed /path/to/targets.bed \
  --out-dir test_data/splits \
  --output-prefix panel \
  --write-indices \
  --unassigned test_data/splits/unassigned.bam

Run POD5 operations:

genemancer pod5 inspect -i /path/to/reads.pod5
genemancer pod5 validate -i /path/to/reads.pod5
genemancer pod5 subsample \
  --input /path/to/run_a.pod5 \
  /path/to/run_b.pod5 \
  --percent 10 \
  --output /path/to/subsampled_outputs

pod5 subsample accepts both repeated and multi-value --input, so shell glob expansion works: --input *.pod5.

If the POD5 shared library is not auto-detected in your environment, set:

export GENEMANCER_POD5_LIB=/path/to/lib_pod5/pod5_format_pybind*.so

Repository Data

references: helper scripts and a tracked sample RG map (references/rg_map.txt).
tests/data: tracked .bai files only.
Local working datasets are expected under test_data/ (ignored by git).

Ignored Paths

Notes

call-targets may prepare a sorted/indexed BGZF FASTA companion (*.sorted.fa.gz plus indexes) when the provided reference is not already in an indexed form suitable for random access.

cnloh Current State

Variant evidence input precedence is --vcf > --snp > BAM marker-site pileup (with a warning when both --vcf and --snp are provided).
Multiple --bam inputs are aggregated as one sample in v1.
BAM scanning is multithreaded and deterministic in merged outputs.
Read filtering defaults skip duplicate, secondary, and supplementary alignments (opt-in include flags are available).
Plot generation emits one combined chromosome-colored CNV/cnLOH figure (*.CNV.png) with coverage/CN/cnLOH panels.
Marker filtering is strict in v1; when marker filtering yields zero variants, cnloh detect exits with an error.
Canonical-chromosome filtering is enabled by default; use --include-noncanonical to disable it.
--variant-mode broad and --sample-mode rg are scaffolded for later versions but not implemented in v1.

TODO

Area	Task	Status	Notes
`split-bam`	Add end-to-end fixture coverage for overlap edge cases	TODO	Validate multi-overlap and boundary behavior
`call-targets`	Add end-to-end integration tests on small fixture set	TODO	Validate VCF content + index generation
`call-targets`	Reject invalid BED intervals (`end <= start`) with a hard error	TODO	Current loader silently skips these rows
`call-targets-gpu`	Expand GPU backend validation matrix	TODO	Cover Vulkan/Metal/DX12 fallback behavior
`call-targets-gpu`	Honor `--threads` for scan worker fanout	TODO	Current streaming path uses one worker per input BAM
`merge-bam`	Add CRAM input/output support	TODO	Current implementation is BAM-focused
`merge-bam`	Align `--index-path` docs with implementation or add CSI writing	TODO	CLI/docs say BAI-or-CSI but writer is BAI-only
`merge-bam`	Reject zero-length BED intervals (`end == start`)	TODO	Current validation only rejects `end < start`
`gtf-to-introns`	Align Rust intron extraction semantics with Bioconductor `intronicParts()`	TODO	Current implementation emits transcript exon-gap introns; needs fixture-level parity checks for overlapping isoforms/shared exons
`cnloh`	Chunk 1: Freeze SubChrom parity spec + fixture baselines	DONE	See `docs/cnloh_parity_spec.md`, `docs/cnloh_baseline_fixtures.md`, and `tests/cnloh_parity_baseline.rs`
`cnloh`	Chunk 2: Add SubChrom-like marker/VAF preprocessing (`minCOV`, `minMAC`)	DONE	Emits `vaf_preprocessed.tsv` + `marker_chrom_stats.tsv` and summary keys
`cnloh`	Chunk 3: Implement VAF/MAF + ROH segmentation parity passes	DONE	Emits `maf.tsv`, `vaf_segments.tsv`, `roh_segments.tsv`, and `vaf_roh_segments.tsv`
`cnloh`	Chunk 4: Align coverage segmentation merge rules with VAF/ROH segments	DONE	Emits `coverage_vaf_segments.tsv` and uses unified coverage+VAF marker gating in event filtering
`cnloh`	Chunk 5: Align event classification + TF estimation with SubChrom intent	TODO	Include allele-specific event summary fields and tolerance-based tests
`cnloh`	Chunk 6: Finalize CNV visualization parity and end-to-end validation	TODO	Snapshot plot metadata and add parity regression tests
`cnloh`	Refactor monolithic `cnloh` implementation into smaller modules	TODO	Split `src/cnloh.rs` into focused units (I/O, preprocessing, segmentation, events, plotting) with clear interfaces
`cnloh`	Improve inline comments and developer-facing documentation	TODO	Add targeted code comments, module docs, and output-file docs for maintainability
`cnloh`	Implement `--variant-mode broad`	TODO	CLI mode is scaffolded but currently hard-fails
`cnloh`	Implement RG-aware `--sample-mode rg` workflow	TODO	v1 aggregates all BAMs as one sample
`cnloh`	Add fixture-level integration tests for strict marker-overlap behavior	TODO	Validate hard-fail path when marker overlap is zero
`cnloh`	Add event-level segmentation/calling outputs (cnLOH/CN event table)	TODO	Current output is coverage/variant summaries + plot
`cnloh`	Expand marker database fixtures beyond minimal toy set	TODO	Improves realistic plot coverage in tracked tests
Docs	Add example outputs and expected file artifacts per command	TODO	Make quick verification easier for users

genemancer 0.2.5