genemancer 0.2.5

Rust CLI toolkit for niche optimized genomics file processing and target-based variant workflows
genemancer-0.2.5 is not a library.

genemancer banner

Genemancer

Rust Version Noodles WGPU Status Lifecycle

Genemancer is a Rust CLI toolkit for genomics file processing, built primarily on the noodles ecosystem, with optional GPU acceleration (wgpu or CUDA) for target-based variant aggregation.

Toolkit

Current subcommands:

  • merge-bam (implemented): merge multiple coordinate-sorted, indexed BAM files into one BAM, with optional BED filtering (all|strict|trim), read-group filtering, output index writing, and configurable compression level.
  • gff-to-gtf (implemented): convert GFF3 annotations to GTF (stdin/stdout supported).
  • gtf-to-introns (implemented): extract transcript intron intervals from GTF annotations and write GFF3 output, with .gtf.gz input support and reference-script-style default output naming. Current behavior derives transcript exon-gap introns; full Bioconductor intronicParts() parity is still pending for complex overlapping transcript models.
  • call-targets (implemented): call simple SNVs from BAM inputs over BED target intervals and write bgzipped VCF output (.vcf.gz) with index (csi default, optional tbi).
  • call-targets-gpu (implemented): same pipeline as call-targets, but attempts GPU initialization and falls back to CPU unless --require-gpu is set.
  • split-bam (implemented): split one or more coordinate-sorted BAM files into per-region BAMs from a BED file, with optional unassigned-read output and optional output indexing.
  • pod5 (implemented): namespace for POD5 operations exposed as genemancer pod5 <operation> (validate and subsample are implemented; inspect is scaffolded).
  • vcf (in progress): namespace for VCF comparison workflows; genemancer vcf diff currently validates multisample-vs-multi-file input semantics and set definitions, but record loading and set-difference computation are still scaffolded.
  • cnloh (in progress): SubChrom-inspired CNV/cnLOH detect pipeline with marker-filtered variant evidence, BAM coverage summary, and a single chromosome-colored CNV.png plot.

Global options:

  • -v/--verbose (repeatable) for log verbosity.
  • -t/--threads to control worker threads.
  • --log-file <FILE> to mirror stderr logs to a file.

Installation

Install from crates.io:

cargo install genemancer

Or install from the local repository checkout:

cargo install --path .

Build And Run From Source

  1. Install a Rust toolchain with edition 2024 support.
  2. Build:
    cargo build
    
  3. Show CLI help:
    cargo run -- --help
    

You can inspect any command with:

cargo run -- <subcommand> --help

Installed Binary Usage

After installing with cargo install genemancer or cargo install --path ., run:

genemancer --help
genemancer <subcommand> --help

If $HOME/.cargo/bin is not on your PATH, use:

~/.cargo/bin/genemancer --help

Usage Examples

Examples below assume you provide your own inputs and use a locally installed binary. In this repository, *.bam and /test_data are gitignored. If you are running from source instead, prefix with cargo run -- (or cargo run --features cuda -- for CUDA-enabled builds).

Merge two BAMs into one BAM with index output:

genemancer merge-bam \
  -i /path/to/input1.bam \
  -i /path/to/input2.bam \
  -o test_data/merged.bam \
  --index

Convert GFF3 to GTF:

genemancer gff-to-gtf \
  -i input.gff3 \
  -o output.gtf

Extract introns from a GTF or gzipped GTF:

genemancer gtf-to-introns /path/to/hg38.ncbiRefSeq.gtf.gz
# writes /path/to/hg38.ncbiRefSeq.introns.gff by default

Call SNVs on target regions (CPU/streaming path):

genemancer call-targets \
  -i /path/to/bams_or_directory \
  -r /path/to/reference.fa.gz \
  -T /path/to/targets.bed \
  --rg-map references/rg_map.txt \
  -o test_data/out.vcf.gz

Run the GPU-enabled path (falls back to CPU by default):

genemancer call-targets-gpu \
  -i /path/to/bams_or_directory \
  -r /path/to/reference.fa.gz \
  -T /path/to/targets.bed \
  --rg-map references/rg_map.txt \
  --gpu-backend auto \
  -o test_data/out.vcf.gz

Run cnLOH/CNV detect with a single colored CNV plot:

genemancer cnloh detect \
  --sample sample_01 \
  --bam /path/to/sample_01_lane1.bam /path/to/sample_01_lane2.bam \
  --vcf /path/to/sample_01.vcf.gz \
  --vcf-sample sample_01 \
  --data-type WGS \
  --reference /path/to/reference.fa \
  --panel-bin WGS \
  --marker-dir /path/to/SNPmarker \
  --log-output test_data/cnloh/sample_01.cnloh.log \
  --plots true \
  --output test_data/cnloh

Get the SubChrom-compatible SNP marker databases (hg38/hg19):

curl -o SNPmarker_hg38.zip "https://zenodo.org/records/10155688/files/SNPmarker_hg38.zip?download=1" && \
  unzip SNPmarker_hg38.zip && rm SNPmarker_hg38.zip

curl -o SNPmarker_hg19.zip "https://zenodo.org/records/10155688/files/SNPmarker_hg19.zip?download=1" && \
  unzip SNPmarker_hg19.zip && rm SNPmarker_hg19.zip

Dockerfile form:

RUN curl -o SNPmarker_hg38.zip https://zenodo.org/records/10155688/files/SNPmarker_hg38.zip?download=1 && \
    unzip SNPmarker_hg38.zip && rm SNPmarker_hg38.zip
RUN curl -o SNPmarker_hg19.zip https://zenodo.org/records/10155688/files/SNPmarker_hg19.zip?download=1 && \
    unzip SNPmarker_hg19.zip && rm SNPmarker_hg19.zip

cnloh detect defaults to canonical chromosomes (chr1-22, chrX, chrY) for output/plot rows. Use --include-noncanonical to keep all contigs.

Build/install with CUDA support and force CUDA backend:

cargo install --path . --features cuda --force
genemancer call-targets-gpu \
  -i /path/to/bams_or_directory \
  -r /path/to/reference.fa.gz \
  -T /path/to/targets.bed \
  --rg-map references/rg_map.txt \
  --gpu-backend cuda \
  --cuda-device 0 \
  --require-gpu \
  -o test_data/out.vcf.gz

Tune GPU behavior explicitly (optional overrides on top of auto/hybrid tuning):

genemancer call-targets-gpu \
  -i /path/to/bams_or_directory \
  -r /path/to/reference.fa.gz \
  -T /path/to/targets.bed \
  --rg-map references/rg_map.txt \
  --gpu-backend auto \
  --tuning-mode hybrid \
  --tuning-profile throughput \
  --tuning-scale-percent 120 \
  --wgpu-matrix-utilization-percent 96 \
  --wgpu-upload-utilization-percent 98 \
  --max-obs-upload 64000000 \
  --stream-matrix-budget-mib 1024 \
  --defer-cuda-aggregation \
  -o test_data/out.vcf.gz

Split multiple BAMs by BED regions into an output folder:

genemancer split-bam \
  -i /path/to/input1.bam \
  -i /path/to/input2.bam \
  --bed /path/to/targets.bed \
  --out-dir test_data/splits \
  --output-prefix panel \
  --write-indices \
  --unassigned test_data/splits/unassigned.bam

Run POD5 operations:

genemancer pod5 inspect -i /path/to/reads.pod5
genemancer pod5 validate -i /path/to/reads.pod5
genemancer pod5 subsample \
  --input /path/to/run_a.pod5 \
  /path/to/run_b.pod5 \
  --percent 10 \
  --output /path/to/subsampled_outputs

pod5 subsample accepts both repeated and multi-value --input, so shell glob expansion works: --input *.pod5.

If the POD5 shared library is not auto-detected in your environment, set:

export GENEMANCER_POD5_LIB=/path/to/lib_pod5/pod5_format_pybind*.so

Repository Data

  • references: helper scripts and a tracked sample RG map (references/rg_map.txt).
  • tests/data: tracked .bai files only.
  • Local working datasets are expected under test_data/ (ignored by git).

Ignored Paths

Notes

  • call-targets may prepare a sorted/indexed BGZF FASTA companion (*.sorted.fa.gz plus indexes) when the provided reference is not already in an indexed form suitable for random access.

cnloh Current State

  • Variant evidence input precedence is --vcf > --snp > BAM marker-site pileup (with a warning when both --vcf and --snp are provided).
  • Multiple --bam inputs are aggregated as one sample in v1.
  • BAM scanning is multithreaded and deterministic in merged outputs.
  • Read filtering defaults skip duplicate, secondary, and supplementary alignments (opt-in include flags are available).
  • Plot generation emits one combined chromosome-colored CNV/cnLOH figure (*.CNV.png) with coverage/CN/cnLOH panels.
  • Marker filtering is strict in v1; when marker filtering yields zero variants, cnloh detect exits with an error.
  • Canonical-chromosome filtering is enabled by default; use --include-noncanonical to disable it.
  • --variant-mode broad and --sample-mode rg are scaffolded for later versions but not implemented in v1.

TODO

Area Task Status Notes
split-bam Add end-to-end fixture coverage for overlap edge cases TODO Validate multi-overlap and boundary behavior
call-targets Add end-to-end integration tests on small fixture set TODO Validate VCF content + index generation
call-targets Reject invalid BED intervals (end <= start) with a hard error TODO Current loader silently skips these rows
call-targets-gpu Expand GPU backend validation matrix TODO Cover Vulkan/Metal/DX12 fallback behavior
call-targets-gpu Honor --threads for scan worker fanout TODO Current streaming path uses one worker per input BAM
merge-bam Add CRAM input/output support TODO Current implementation is BAM-focused
merge-bam Align --index-path docs with implementation or add CSI writing TODO CLI/docs say BAI-or-CSI but writer is BAI-only
merge-bam Reject zero-length BED intervals (end == start) TODO Current validation only rejects end < start
gtf-to-introns Align Rust intron extraction semantics with Bioconductor intronicParts() TODO Current implementation emits transcript exon-gap introns; needs fixture-level parity checks for overlapping isoforms/shared exons
cnloh Chunk 1: Freeze SubChrom parity spec + fixture baselines DONE See docs/cnloh_parity_spec.md, docs/cnloh_baseline_fixtures.md, and tests/cnloh_parity_baseline.rs
cnloh Chunk 2: Add SubChrom-like marker/VAF preprocessing (minCOV, minMAC) DONE Emits vaf_preprocessed.tsv + marker_chrom_stats.tsv and summary keys
cnloh Chunk 3: Implement VAF/MAF + ROH segmentation parity passes DONE Emits maf.tsv, vaf_segments.tsv, roh_segments.tsv, and vaf_roh_segments.tsv
cnloh Chunk 4: Align coverage segmentation merge rules with VAF/ROH segments DONE Emits coverage_vaf_segments.tsv and uses unified coverage+VAF marker gating in event filtering
cnloh Chunk 5: Align event classification + TF estimation with SubChrom intent TODO Include allele-specific event summary fields and tolerance-based tests
cnloh Chunk 6: Finalize CNV visualization parity and end-to-end validation TODO Snapshot plot metadata and add parity regression tests
cnloh Refactor monolithic cnloh implementation into smaller modules TODO Split src/cnloh.rs into focused units (I/O, preprocessing, segmentation, events, plotting) with clear interfaces
cnloh Improve inline comments and developer-facing documentation TODO Add targeted code comments, module docs, and output-file docs for maintainability
cnloh Implement --variant-mode broad TODO CLI mode is scaffolded but currently hard-fails
cnloh Implement RG-aware --sample-mode rg workflow TODO v1 aggregates all BAMs as one sample
cnloh Add fixture-level integration tests for strict marker-overlap behavior TODO Validate hard-fail path when marker overlap is zero
cnloh Add event-level segmentation/calling outputs (cnLOH/CN event table) TODO Current output is coverage/variant summaries + plot
cnloh Expand marker database fixtures beyond minimal toy set TODO Improves realistic plot coverage in tracked tests
Docs Add example outputs and expected file artifacts per command TODO Make quick verification easier for users