
Genemancer
Genemancer is a Rust CLI toolkit for genomics file processing, built primarily on the noodles ecosystem, with optional GPU acceleration (wgpu or CUDA) for target-based variant aggregation.
Toolkit
Current subcommands:
merge-bam(implemented): merge multiple coordinate-sorted, indexed BAM files into one BAM, with optional BED filtering (all|strict|trim), read-group filtering, output index writing, and configurable compression level.gff-to-gtf(implemented): convert GFF3 annotations to GTF (stdin/stdout supported).gtf-to-introns(implemented): extract transcript intron intervals from GTF annotations and write GFF3 output, with.gtf.gzinput support and reference-script-style default output naming. Current behavior derives transcript exon-gap introns; full BioconductorintronicParts()parity is still pending for complex overlapping transcript models.call-targets(implemented): call simple SNVs from BAM inputs over BED target intervals and write bgzipped VCF output (.vcf.gz) with index (csidefault, optionaltbi).call-targets-gpu(implemented): same pipeline ascall-targets, but attempts GPU initialization and falls back to CPU unless--require-gpuis set.split-bam(implemented): split one or more coordinate-sorted BAM files into per-region BAMs from a BED file, with optional unassigned-read output and optional output indexing.pod5(implemented): namespace for POD5 operations exposed asgenemancer pod5 <operation>(validateandsubsampleare implemented;inspectis scaffolded).vcf(in progress): namespace for VCF comparison workflows;genemancer vcf diffcurrently validates multisample-vs-multi-file input semantics and set definitions, but record loading and set-difference computation are still scaffolded.cnloh(in progress): SubChrom-inspired CNV/cnLOH detect pipeline with marker-filtered variant evidence, BAM coverage summary, and a single chromosome-coloredCNV.pngplot.
Global options:
-v/--verbose(repeatable) for log verbosity.-t/--threadsto control worker threads.--log-file <FILE>to mirror stderr logs to a file.
Installation
Install from crates.io:
Or install from the local repository checkout:
Build And Run From Source
- Install a Rust toolchain with edition 2024 support.
- Build:
- Show CLI help:
You can inspect any command with:
Installed Binary Usage
After installing with cargo install genemancer or cargo install --path ., run:
If $HOME/.cargo/bin is not on your PATH, use:
Usage Examples
Examples below assume you provide your own inputs and use a locally installed binary. In this repository, *.bam and /test_data are gitignored.
If you are running from source instead, prefix with cargo run -- (or cargo run --features cuda -- for CUDA-enabled builds).
Merge two BAMs into one BAM with index output:
Convert GFF3 to GTF:
Extract introns from a GTF or gzipped GTF:
# writes /path/to/hg38.ncbiRefSeq.introns.gff by default
Call SNVs on target regions (CPU/streaming path):
Run the GPU-enabled path (falls back to CPU by default):
Run cnLOH/CNV detect with a single colored CNV plot:
Get the SubChrom-compatible SNP marker databases (hg38/hg19):
&& \
&&
&& \
&&
Dockerfile form:
RUN curl -o SNPmarker_hg38.zip https://zenodo.org/records/10155688/files/SNPmarker_hg38.zip?download=1 && \
unzip SNPmarker_hg38.zip && rm SNPmarker_hg38.zip
RUN curl -o SNPmarker_hg19.zip https://zenodo.org/records/10155688/files/SNPmarker_hg19.zip?download=1 && \
unzip SNPmarker_hg19.zip && rm SNPmarker_hg19.zip
cnloh detect defaults to canonical chromosomes (chr1-22, chrX, chrY) for output/plot rows.
Use --include-noncanonical to keep all contigs.
Build/install with CUDA support and force CUDA backend:
Tune GPU behavior explicitly (optional overrides on top of auto/hybrid tuning):
Split multiple BAMs by BED regions into an output folder:
Run POD5 operations:
pod5 subsample accepts both repeated and multi-value --input, so shell glob expansion works:
--input *.pod5.
If the POD5 shared library is not auto-detected in your environment, set:
Repository Data
references: helper scripts and a tracked sample RG map (references/rg_map.txt).tests/data: tracked.baifiles only.- Local working datasets are expected under
test_data/(ignored by git).
Ignored Paths
Notes
call-targetsmay prepare a sorted/indexed BGZF FASTA companion (*.sorted.fa.gzplus indexes) when the provided reference is not already in an indexed form suitable for random access.
cnloh Current State
- Variant evidence input precedence is
--vcf>--snp> BAM marker-site pileup (with a warning when both--vcfand--snpare provided). - Multiple
--baminputs are aggregated as one sample in v1. - BAM scanning is multithreaded and deterministic in merged outputs.
- Read filtering defaults skip duplicate, secondary, and supplementary alignments (opt-in include flags are available).
- Plot generation emits one combined chromosome-colored CNV/cnLOH figure (
*.CNV.png) with coverage/CN/cnLOH panels. - Marker filtering is strict in v1; when marker filtering yields zero variants,
cnloh detectexits with an error. - Canonical-chromosome filtering is enabled by default; use
--include-noncanonicalto disable it. --variant-mode broadand--sample-mode rgare scaffolded for later versions but not implemented in v1.
TODO
| Area | Task | Status | Notes |
|---|---|---|---|
split-bam |
Add end-to-end fixture coverage for overlap edge cases | TODO | Validate multi-overlap and boundary behavior |
call-targets |
Add end-to-end integration tests on small fixture set | TODO | Validate VCF content + index generation |
call-targets |
Reject invalid BED intervals (end <= start) with a hard error |
TODO | Current loader silently skips these rows |
call-targets-gpu |
Expand GPU backend validation matrix | TODO | Cover Vulkan/Metal/DX12 fallback behavior |
call-targets-gpu |
Honor --threads for scan worker fanout |
TODO | Current streaming path uses one worker per input BAM |
merge-bam |
Add CRAM input/output support | TODO | Current implementation is BAM-focused |
merge-bam |
Align --index-path docs with implementation or add CSI writing |
TODO | CLI/docs say BAI-or-CSI but writer is BAI-only |
merge-bam |
Reject zero-length BED intervals (end == start) |
TODO | Current validation only rejects end < start |
gtf-to-introns |
Align Rust intron extraction semantics with Bioconductor intronicParts() |
TODO | Current implementation emits transcript exon-gap introns; needs fixture-level parity checks for overlapping isoforms/shared exons |
cnloh |
Chunk 1: Freeze SubChrom parity spec + fixture baselines | DONE | See docs/cnloh_parity_spec.md, docs/cnloh_baseline_fixtures.md, and tests/cnloh_parity_baseline.rs |
cnloh |
Chunk 2: Add SubChrom-like marker/VAF preprocessing (minCOV, minMAC) |
DONE | Emits vaf_preprocessed.tsv + marker_chrom_stats.tsv and summary keys |
cnloh |
Chunk 3: Implement VAF/MAF + ROH segmentation parity passes | DONE | Emits maf.tsv, vaf_segments.tsv, roh_segments.tsv, and vaf_roh_segments.tsv |
cnloh |
Chunk 4: Align coverage segmentation merge rules with VAF/ROH segments | DONE | Emits coverage_vaf_segments.tsv and uses unified coverage+VAF marker gating in event filtering |
cnloh |
Chunk 5: Align event classification + TF estimation with SubChrom intent | TODO | Include allele-specific event summary fields and tolerance-based tests |
cnloh |
Chunk 6: Finalize CNV visualization parity and end-to-end validation | TODO | Snapshot plot metadata and add parity regression tests |
cnloh |
Refactor monolithic cnloh implementation into smaller modules |
TODO | Split src/cnloh.rs into focused units (I/O, preprocessing, segmentation, events, plotting) with clear interfaces |
cnloh |
Improve inline comments and developer-facing documentation | TODO | Add targeted code comments, module docs, and output-file docs for maintainability |
cnloh |
Implement --variant-mode broad |
TODO | CLI mode is scaffolded but currently hard-fails |
cnloh |
Implement RG-aware --sample-mode rg workflow |
TODO | v1 aggregates all BAMs as one sample |
cnloh |
Add fixture-level integration tests for strict marker-overlap behavior | TODO | Validate hard-fail path when marker overlap is zero |
cnloh |
Add event-level segmentation/calling outputs (cnLOH/CN event table) | TODO | Current output is coverage/variant summaries + plot |
cnloh |
Expand marker database fixtures beyond minimal toy set | TODO | Improves realistic plot coverage in tracked tests |
| Docs | Add example outputs and expected file artifacts per command | TODO | Make quick verification easier for users |