varforge 0.1.0

Synthetic cancer sequencing test data generator
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
# VarForge

[![CI](https://github.com/varforge/varforge/actions/workflows/ci.yml/badge.svg)](https://github.com/varforge/varforge/actions)
[![Crates.io](https://img.shields.io/crates/v/varforge.svg)](https://crates.io/crates/varforge)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)

VarForge is a fast, single-binary Rust tool for generating synthetic cancer sequencing test data with controlled ground truth. It produces realistic FASTQ and BAM files with known mutations, tumour parameters, UMI barcodes, structural variants, and cfDNA fragment profiles for benchmarking bioinformatics pipelines.

---

## Features

| Feature | VarForge | BAMSurgeon | ART | NEAT |
|---------|----------|------------|-----|------|
| Single binary, no runtime deps | Yes | No | No | No |
| Somatic mutations (SNV/indel/MNV) | Yes | Yes | No | Partial |
| Structural variants (DEL/INS/INV/DUP/TRA) | Yes | No | No | No |
| SV signatures (HRD, TDP, chromothripsis) | Yes | No | No | No |
| COSMIC SBS signature weighting | Yes | No | No | No |
| Tumour purity / clonal architecture | Yes | No | No | No |
| Paired tumour-normal simulation | Yes | Partial | No | No |
| Germline variant simulation | Yes | No | No | Yes |
| cfDNA fragment model | Yes | No | No | No |
| Long-read fragment model | Yes | No | No | No |
| Duplex UMI barcodes | Yes | No | No | No |
| FFPE / oxoG artefacts | Yes | No | No | No |
| Longitudinal / multi-sample series | Yes | No | No | No |
| Copy number alterations | Yes | No | No | No |
| GC bias model | Yes | No | Partial | No |
| Hybrid-capture / amplicon model | Yes | No | No | No |
| Microsatellite instability (MSI) | Yes | No | No | No |
| Truth VCF output | Yes | Partial | No | Yes |
| YAML configuration | Yes | No | No | No |

---

## Installation

### From crates.io

```
cargo install varforge
```

### From source

```
git clone https://github.com/varforge/varforge
cd varforge
cargo build --release
./target/release/varforge --help
```

Requires Rust 1.74 or later. No C libraries are needed. The entire dependency stack is pure Rust.

---

## Quickstart

The following example generates a 30x WGS tumour sample with 5000 random somatic mutations.

**1. Create a config file (`quickstart.yaml`):**

```yaml
reference: /data/ref/hg38.fa

output:
  directory: out/quickstart
  fastq: true
  truth_vcf: true

sample:
  name: TUMOUR
  coverage: 30.0

tumour:
  purity: 0.70

mutations:
  random:
    count: 5000
    vaf_min: 0.05
    vaf_max: 0.60
    snv_fraction: 0.80
    indel_fraction: 0.15
    mnv_fraction: 0.05

seed: 42
```

**2. Validate the config:**

```
varforge validate --config quickstart.yaml
```

**3. Run the simulation:**

```
varforge simulate --config quickstart.yaml
```

**4. Inspect the output:**

```
out/quickstart/
  TUMOUR_R1.fastq.gz       # Read 1 FASTQ
  TUMOUR_R2.fastq.gz       # Read 2 FASTQ
  truth.vcf.gz             # Ground-truth VCF with all injected variants
  manifest.tsv             # Sample metadata (name, coverage, purity, paths)
```

**CLI overrides** allow any config value to be changed at the command line:

```
varforge simulate --config quickstart.yaml \
    --coverage 60 --purity 0.5 --seed 99
```

**Variable substitution** lets configs use placeholders resolved at runtime:

```yaml
reference: ${reference}
```

```
varforge simulate --config quickstart.yaml --set reference=/data/ref/hg38.fa
```

**Presets** skip the config entirely for common scenarios:

```
varforge simulate --config quickstart.yaml --preset wgs
varforge simulate --config quickstart.yaml --preset cancer:melanoma
```

---

## CLI Reference

```
varforge [OPTIONS] <COMMAND>

Commands:
  simulate         Run a simulation from a YAML config
  validate         Validate a YAML config without running
  edit             Spike variants into an existing BAM file
  learn-profile    Learn an error/quality profile from a real BAM file
  benchmark-suite  Run a VAF x coverage benchmark grid

Global options:
  -t, --threads <N>          Number of threads (default: all available cores)
      --log-level <LEVEL>    error | warn | info | debug | trace (default: info)
```

### `simulate`

```
varforge simulate --config <FILE> [OPTIONS]

Options:
  -c, --config <FILE>           Path to YAML configuration file (required)
  -o, --output-dir <DIR>        Override output directory
      --seed <N>                Override random seed
      --coverage <F>            Override coverage depth (x)
      --read-length <N>         Override read length (bp)
      --purity <F>              Override tumour purity (0.0-1.0)
      --fragment-mean <F>       Override fragment mean length (bp)
      --fragment-sd <F>         Override fragment length standard deviation (bp)
      --random-mutations <N>    Generate N random mutations (no VCF needed)
      --vaf-range <MIN-MAX>     VAF range for random mutations (e.g. 0.001-0.05)
      --preset <NAME>           Apply a named preset (see Presets section)
      --set <KEY=VALUE>         Set config variables; replaces ${key} in YAML (repeatable)
      --list-presets            List all available presets and exit
      --dry-run                 Validate config and estimate output size only
```

### `validate`

```
varforge validate --config <FILE>
```

Parses the YAML config and checks all fields for consistency. Exits with status 0 if valid, non-zero otherwise with a descriptive error message.

### `edit`

```
varforge edit --bam <IN.bam> --vcf <VARIANTS.vcf> --output <OUT.bam>
```

Spikes variants from a VCF directly into an existing BAM file without re-simulating reads. Useful for adding a handful of known mutations to a real or previously simulated dataset.

### `learn-profile`

```
varforge learn-profile --bam <BAM> --output <PROFILE.json>
```

Learns an empirical base-quality and error profile from a real BAM file. The resulting JSON can be referenced from the `quality.profile_path` config field to produce reads with a realistic, data-driven quality model instead of the parametric default.

### `benchmark-suite`

```
varforge benchmark-suite --config <FILE> --vafs 0.01,0.05,0.1 --coverages 100,500,1000
```

Runs a grid of simulations across specified VAF and coverage values. Each combination produces its own output directory. Useful for generating sensitivity curves and limit-of-detection analyses.

---

## Configuration Reference

All simulation parameters are specified in a YAML file. Only `reference` and `output.directory` are required. Everything else has a sensible default.

### Top-level fields

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `reference` | path | (required) | Path to FASTA reference genome |
| `output` | OutputConfig | (required) | Output format and directory |
| `sample` | SampleConfig | see below | Read generation parameters |
| `fragment` | FragmentConfig | see below | Insert size distribution |
| `quality` | QualityConfig | see below | Base quality model |
| `tumour` | TumourConfig | null | Tumour purity and clonal architecture |
| `mutations` | MutationConfig | null | Somatic mutation injection |
| `umi` | UmiConfig | null | UMI barcode configuration |
| `artifacts` | ArtifactConfig | null | Sequencing artefact simulation |
| `copy_number` | list of CopyNumberConfig | null | Copy number alterations |
| `gc_bias` | GcBiasConfig | null | GC content coverage bias |
| `capture` | CaptureConfig | null | Hybrid-capture or amplicon enrichment model |
| `germline` | GermlineConfig | null | Germline SNP/indel simulation |
| `paired` | PairedConfig | null | Matched tumour-normal pair mode |
| `samples` | list of SampleEntry | null | Multi-sample / longitudinal series |
| `chromosomes` | list of strings | null (all) | Restrict simulation to named chromosomes |
| `regions_bed` | path | null | Restrict simulation to BED file regions |
| `vafs` | list of floats | null | Batch mode: one run per VAF value |
| `preset` | string | null | Chemistry preset name (applied before YAML values) |
| `performance` | PerformanceConfig | see below | Streaming pipeline tuning |
| `seed` | integer | null (random) | Random seed for reproducibility |
| `threads` | integer | null (all cores) | Worker thread count |

---

### `output`

```yaml
output:
  directory: out/my_run   # required
  fastq: true             # write gzip-compressed FASTQ files (default: true)
  bam: false              # write coordinate-sorted BAM (default: false)
  truth_vcf: true         # write ground-truth VCF (default: true)
  germline_vcf: true      # write germline truth VCF when germline is enabled (default: true)
  manifest: true          # write manifest.tsv (default: true)
  single_read_bam: false  # single-read BAM for long-read platforms (default: false)
  mapq: 60                # mapping quality for BAM records (default: 60)
```

---

### `sample`

```yaml
sample:
  name: TUMOUR_01       # sample name used in file names and read headers (default: SAMPLE)
  read_length: 150      # read length in bp (default: 150)
  coverage: 30.0        # mean target coverage depth (default: 30.0)
  platform: illumina    # sequencing platform tag; written to BAM @RG header (optional)
```

---

### `fragment`

Controls the insert size (fragment length) distribution.

```yaml
fragment:
  model: normal    # normal | cfda (default: normal)
  mean: 300.0      # mean fragment length in bp (default: 300.0)
  sd: 50.0         # standard deviation in bp (default: 50.0)
```

**Fragment models:**

- `normal`: Gaussian distribution. Suitable for standard library prep from fresh-frozen tissue or cell lines.
- `cfda`: Short, nucleosome-phased distribution reflecting cell-free DNA in plasma. Typical mean ~167 bp with mononucleosomal and dinucleosomal peaks and 10 bp periodicity from nucleosome positioning.

**cfDNA-specific options** (only apply when `model: cfda`):

```yaml
fragment:
  model: cfda
  ctdna_fraction: 0.05     # fraction of tumour-derived shorter fragments (default: derived from purity)
  mono_sd: 20.0            # SD of mononucleosomal peak in bp (default: 20.0)
  di_sd: 30.0              # SD of dinucleosomal peak in bp (default: 30.0)
  end_motif_model: plasma  # plasma cfDNA 4-mer end motif rejection sampling (optional)
```

**Long-read fragment model** (PacBio, Nanopore):

```yaml
fragment:
  long_read:
    mean: 15000    # mean length in bp (default: 15000)
    sd: 5000       # standard deviation in bp (default: 5000)
    min_len: 1000  # minimum fragment length (default: 1000)
    max_len: 100000  # maximum fragment length (default: 100000)
```

When `long_read` is set, the log-normal sampler is used instead of the normal or cfDNA sampler. Combine with `output.single_read_bam: true` for realistic long-read BAM output.

---

### `quality`

```yaml
quality:
  mean_quality: 36        # mean Phred quality score for the first cycle (default: 36)
  tail_decay: 0.003       # per-cycle quality decay rate (default: 0.003)
  profile_path: null      # optional path to empirical profile JSON from learn-profile
```

If `profile_path` is set, the empirical profile overrides `mean_quality` and `tail_decay`.

---

### `tumour`

```yaml
tumour:
  purity: 0.70    # fraction of cells that are tumour (0.0-1.0; default: 1.0)
  ploidy: 2       # tumour ploidy (default: 2)
  msi: false      # microsatellite instability mode (default: false)
  clones:         # optional list of clones for subclonal architecture
    - id: trunk
      ccf: 1.0           # cancer cell fraction (0.0-1.0)
    - id: subclone_a
      ccf: 0.40
      parent: trunk      # parent clone ID (optional; omit for founding clone)
```

When `clones` is empty, all mutations are assigned to a single clonal population at the specified `purity`. When clones are defined, mutations are distributed across the clone tree and their effective VAF is:

```
VAF = purity x CCF / ploidy
```

When `msi: true`, indel rates at homopolymer and dinucleotide repeat loci are elevated to simulate MSI-high tumours.

---

### `mutations`

```yaml
mutations:
  vcf: /path/to/variants.vcf.gz   # optional: inject specific variants from VCF
  random:                          # optional: add random somatic mutations
    count: 5000                    # number of mutations to generate
    vaf_min: 0.001                 # minimum VAF (default: 0.001)
    vaf_max: 0.50                  # maximum VAF (default: 0.5)
    snv_fraction: 0.80             # fraction that are SNVs (default: 0.80)
    indel_fraction: 0.15           # fraction that are indels (default: 0.15)
    mnv_fraction: 0.05             # fraction that are MNVs (default: 0.05)
    signature: SBS7a               # COSMIC SBS signature for weighted base selection (optional)
  sv_signature: HRD                # SV signature: HRD, TDP, or CHROMOTHRIPSIS (optional)
  sv_count: 10                     # number of SVs to generate for the signature (default: 10)
  include_driver_mutations: false  # inject driver mutations from cancer preset (default: false)
```

`snv_fraction + indel_fraction + mnv_fraction` must sum to exactly 1.0.

Both `vcf` and `random` may be specified simultaneously. VCF variants are injected first, then random mutations are added at non-overlapping positions.

**SV signatures** generate structural variants with biologically realistic size and type distributions:

- `HRD`: large deletions (100 kbp to 10 Mbp) characteristic of homologous recombination deficiency.
- `TDP`: short tandem duplications (1 kbp to 10 kbp) characteristic of the tandem duplicator phenotype.
- `CHROMOTHRIPSIS`: clustered rearrangements (deletions, inversions, duplications) on a single chromosome.

**COSMIC SBS signatures** weight the alternate base selection by trinucleotide context probabilities from the COSMIC catalogue. For example, `signature: SBS7a` produces UV-type C>T mutations in dipyrimidine contexts typical of melanoma.

---

### `umi`

```yaml
umi:
  length: 8              # UMI barcode length in bases (default: 8)
  duplex: false          # enable duplex (double-stranded) UMI mode (default: false)
  pcr_cycles: 10         # number of PCR amplification cycles (default: 10)
  family_size_mean: 3.0  # mean read family size (default: 3.0)
  family_size_sd: 1.5    # standard deviation of family size (default: 1.5)
  inline: true           # prepend UMI to read sequence (default: false)
```

When `inline: true`, the UMI is prepended to the read sequence (e.g. for fgbio `ExtractUmisFromBam`). When `inline: false`, the UMI is written into the read name (e.g. `@READ:ACGTACGT`).

When `duplex: true`, each molecule is tagged with a strand-specific UMI pair supporting duplex consensus calling tools such as fgbio `CallDuplexConsensusReads`.

---

### `artifacts`

```yaml
artifacts:
  ffpe_damage_rate: 0.02    # C>T deamination rate (0.0-1.0; null = disabled)
  oxog_rate: 0.01           # 8-oxoG C>A transversion rate (0.0-1.0; null = disabled)
  duplicate_rate: 0.15      # PCR duplicate fraction (0.0-1.0; null = disabled)
  pcr_error_rate: 0.001     # PCR substitution error rate per base (null = disabled)
```

All fields are optional. Omit the entire `artifacts` block (or set individual rates to `null`) to disable artefact simulation.

---

### `copy_number`

```yaml
copy_number:
  - region: "chr7:55000000-55200000"   # chrom:start-end (1-based, inclusive)
    tumor_cn: 4                        # tumour copy number (default: 2)
    normal_cn: 2                       # normal copy number (default: 2)
    major_cn: 3                        # major allele CN for LOH modelling (optional)
    minor_cn: 1                        # minor allele CN for LOH modelling (optional)
```

Multiple entries may be listed. Overlapping regions are applied in order (last wins). Read depth in each region is scaled proportionally to `tumor_cn / normal_cn`.

---

### `gc_bias`

```yaml
gc_bias:
  enabled: true      # apply GC bias model (default: true when block is present)
  model: default     # default | flat | custom (default: "default")
  severity: 1.0      # bias multiplier: 0 = none, 1 = realistic, 2 = extreme (default: 1.0)
```

The `default` model applies an empirical coverage reduction at GC extremes (< 30 % or > 70 % GC). Setting `severity: 0` disables the effect while keeping the block present.

---

### `capture`

```yaml
capture:
  enabled: true                             # activate capture model (default: true)
  mode: panel                               # panel | amplicon (default: panel)
  targets_bed: /data/panels/panel.bed       # path to capture target BED (optional)
  off_target_fraction: 0.20                 # fraction of reads mapping off-target (default: 0.2)
  coverage_uniformity: 0.30                 # per-target LogNormal sigma (0 = uniform; default: 0.3)
  edge_dropoff_bases: 50                    # exponential dropoff at target edges in bp (default: 50)
  primer_trim: 0                            # bases to trim from read ends in amplicon mode (default: 0)
  coverage_cv_target: 0.25                  # warn if achieved CV exceeds this (optional)
  on_target_fraction_target: 0.95           # warn if on-target fraction falls below this (optional)
```

In `panel` mode, reads are distributed across targets with off-target spillover. In `amplicon` mode, fragments exactly span each target region with no off-target reads. When `targets_bed` is omitted, the capture model distributes reads uniformly across whichever chromosomes or regions are active.

---

### `germline`

```yaml
germline:
  het_snp_density: 0.6     # heterozygous SNPs per kbp (default: 0.6)
  hom_snp_density: 0.3     # homozygous SNPs per kbp (default: 0.3)
  het_indel_density: 0.05   # heterozygous indels per kbp (default: 0.05)
  vcf: /path/to/germline.vcf  # use specific germline variants instead of random (optional)
```

Germline variants are assigned VAF 0.5 (heterozygous) or 1.0 (homozygous). They appear in the separate `germline_truth.vcf` output.

---

### `paired`

```yaml
paired:
  normal_coverage: 30.0                   # coverage for the normal sample (default: 30.0)
  normal_sample_name: NORMAL              # sample name for normal output (default: NORMAL)
  tumour_contamination_in_normal: 0.0     # tumour contamination in normal (0.0-1.0; default: 0.0)
```

When present, VarForge runs two simulations: one tumour sample (with all somatic and germline variants) and one normal sample (germline only). Outputs are written to `tumour/` and `normal/` sub-directories under `output.directory`.

---

### `samples` (multi-sample / longitudinal)

When `samples` is present, VarForge generates one output sub-directory per entry and a combined `manifest.tsv`. Each entry shares the top-level `reference`, `mutations`, `tumour`, and `fragment` settings but can override coverage, tumour fraction, and fragment model independently.

```yaml
samples:
  - name: timepoint_1
    coverage: 1000.0
    tumour_fraction: 0.05        # ctDNA fraction for this sample (default: 1.0)
    fragment_model: cfda         # override fragment model (optional)
    clonal_shift:                # per-clone CCF adjustments at this timepoint (optional)
      subclone_a: 0.10
  - name: timepoint_2
    coverage: 1000.0
    tumour_fraction: 0.002
    fragment_model: cfda
```

---

### `performance`

```yaml
performance:
  output_buffer_regions: 64    # max region batches buffered in streaming channel (default: 64)
```

Higher values use more memory but provide better overlap between compute and I/O. Lower values reduce peak memory.

---

## Presets Reference

Presets are named configuration bundles that set sensible defaults for common scenarios. A preset is applied before the YAML config values, so explicit YAML fields always win.

```
varforge simulate --config base.yaml --preset <NAME>
varforge simulate --list-presets
```

### Built-in presets

| Preset | Coverage | Fragment | Mutations | UMI | Notes |
|--------|----------|----------|-----------|-----|-------|
| `small` | 1x | normal | 100 random (chr22 only) | no | Smoke test; completes in ~30 s |
| `panel` | 500x | normal | 50 random | inline 8-mer | Targeted panel benchmarking |
| `wgs` | 30x | normal | 5 000 random | no | Whole-genome variant calling |
| `cfdna` | 200x | cfda (167 bp) | 200 random, VAF 0.1-5% | duplex | Liquid biopsy simulation |
| `ffpe` | 30x | normal | 500 random | no | FFPE artefacts enabled |
| `umi` | 1 000x | normal | 50 random | duplex 9-mer | High-depth duplex consensus |

### Cancer-type presets

Cancer presets are accessed with the `cancer:` namespace prefix. Each preset sets biologically realistic mutation counts, VAF ranges, tumour purity, and mutation-type fractions based on published COSMIC mutational signatures.

```
varforge simulate --config base.yaml --preset cancer:melanoma
```

| Preset | Cancer | Dominant Signature | Typical TMB | Purity |
|--------|--------|--------------------|-------------|--------|
| `cancer:lung_adeno` | Lung adenocarcinoma | SBS4 (smoking, C>A) | ~8 mut/Mb | 60% |
| `cancer:colorectal` | Colorectal (MSS) | SBS1/SBS5 (aging, C>T) | ~5 mut/Mb | 65% |
| `cancer:breast_tnbc` | Triple-negative breast | SBS3 (HRD, flat) | ~5 mut/Mb | 55% |
| `cancer:melanoma` | Cutaneous melanoma | SBS7a/b (UV, C>T) | ~30 mut/Mb | 70% |
| `cancer:aml` | Acute myeloid leukaemia | SBS1/SBS5 (aging) | ~1 mut/Mb | 80% |
| `cancer:prostate` | Prostate adenocarcinoma | SBS1/SBS5 (aging) | ~2 mut/Mb | 50% |
| `cancer:pancreatic` | Pancreatic ductal | SBS1/SBS5 (aging) | ~3 mut/Mb | 25% |
| `cancer:glioblastoma` | Glioblastoma (IDH-wt) | SBS1/SBS5 (aging) | ~4 mut/Mb | 65% |

---

## Example Configs

Ready-to-use example configs are in the `examples/` directory.

| File | Use case |
|------|----------|
| [`examples/minimal.yaml`]examples/minimal.yaml | Simplest possible simulation (defaults only) |
| [`examples/wgs_30x.yaml`]examples/wgs_30x.yaml | Standard 30x WGS tumour with random mutations |
| [`examples/panel_umi.yaml`]examples/panel_umi.yaml | Targeted panel with inline 8-mer UMI |
| [`examples/cfdna_monitoring.yaml`]examples/cfdna_monitoring.yaml | cfDNA longitudinal series (4 timepoints) |
| [`examples/ffpe_artifacts.yaml`]examples/ffpe_artifacts.yaml | FFPE-damaged tumour sample |
| [`examples/tumor_normal.yaml`]examples/tumor_normal.yaml | Matched tumour/normal pair |
| [`examples/subclonal.yaml`]examples/subclonal.yaml | Four-clone tumour with copy number alterations |
| [`examples/high_depth.yaml`]examples/high_depth.yaml | 1000x duplex UMI for low-VAF detection |
| [`examples/custom_mutations.yaml`]examples/custom_mutations.yaml | Inject specific variants from a VCF file |
| [`examples/twist_duplex_benchmark.yaml`]examples/twist_duplex_benchmark.yaml | Twist duplex capture panel with HRD SVs |

---

## Output Formats

### FASTQ headers

Read names follow the format:

```
@{SAMPLE}:{CHROM}:{POS}:{READ_NUM}[:UMI={BARCODE}]
```

Example: `@TUMOUR:chr7:55191822:1:UMI=ACGTACGT`

The UMI suffix is only present when a `umi` block is configured and `inline: false`.

### BAM tags

When BAM output is enabled (`output.bam: true`), the following non-standard tags are written:

| Tag | Type | Description |
|-----|------|-------------|
| `RX` | Z | Raw UMI sequence |
| `MI` | Z | Molecule ID (read family identifier) |
| `tp` | i | 1 if the read carries a simulated somatic variant, 0 otherwise |
| `cl` | Z | Clone ID the variant was assigned to |

### Truth VCF fields

The truth VCF written to `truth.vcf.gz` uses the following INFO fields:

| Field | Description |
|-------|-------------|
| `VAF` | Target variant allele frequency |
| `CLONE` | Clone ID the variant was assigned to |
| `CCF` | Cancer cell fraction of the assigned clone |
| `TYPE` | Variant type: `SNV`, `INDEL`, `MNV`, or `SV` |

---

## Use Case Recipes

### Benchmarking a somatic variant caller

```yaml
# Generate matched tumour/normal with known variants.
reference: /data/ref/hg38.fa
output:
  directory: out/caller_bench
  bam: true
  truth_vcf: true
samples:
  - name: TUMOUR
    coverage: 60.0
    tumour_fraction: 1.0
  - name: NORMAL
    coverage: 30.0
    tumour_fraction: 0.0
tumour:
  purity: 0.65
mutations:
  random:
    count: 1000
    vaf_min: 0.05
    vaf_max: 0.60
    snv_fraction: 0.80
    indel_fraction: 0.15
    mnv_fraction: 0.05
seed: 42
```

Run your caller against `TUMOUR/` with `NORMAL/` as the matched normal. Evaluate with:

```
bcftools stats --apply-filters PASS \
    caller_output.vcf.gz truth.vcf.gz
```

### Benchmarking UMI deduplication (fgbio)

Use `examples/panel_umi.yaml` with `inline: true` and then pipe through fgbio:

```
fgbio ExtractUmisFromBam \
    --input out/panel_umi/PANEL_UMI.bam \
    --output extracted.bam \
    --read-structure 8M+T 8M+T

fgbio GroupReadsByUmi \
    --input extracted.bam \
    --output grouped.bam \
    --strategy paired

fgbio CallMolecularConsensusReads \
    --input grouped.bam \
    --output consensus.bam \
    --min-reads 1
```

### Liquid biopsy sensitivity curve

Generate cfDNA samples at multiple tumour fractions and measure detection rate at each:

```
for TF in 0.10 0.05 0.01 0.005 0.001; do
    varforge simulate --config examples/cfdna_monitoring.yaml \
        --output-dir out/tf_${TF} \
        --purity ${TF} \
        --seed 42
done
```

### SV signature benchmarking

Generate data with HRD-type structural variants for SV caller evaluation:

```yaml
mutations:
  random:
    count: 500
    snv_fraction: 0.80
    indel_fraction: 0.15
    mnv_fraction: 0.05
  sv_signature: HRD
  sv_count: 20
```

### FFPE artefact filter development

Use `examples/ffpe_artifacts.yaml` to generate data with realistic FFPE damage, then evaluate your artefact filter:

- True positives: variants present in `truth.vcf.gz`
- Artefacts: C>T calls not in the truth VCF with strand bias (use the `tp` BAM tag to distinguish)

---

## Performance

VarForge uses a streaming architecture: rayon workers generate reads in parallel and send them through a bounded crossbeam channel to a dedicated writer thread. Memory scales with channel depth, not dataset size.

### Thread count

```
varforge simulate --config cfg.yaml --threads 16
```

Or set in the config:

```yaml
threads: 16
```

### Restricting scope

For development and testing, restrict to one or a few chromosomes:

```yaml
chromosomes:
  - chr22
```

Or to a BED file of target regions:

```yaml
regions_bed: /data/panels/hotspot_panel.bed
```

### Approximate runtimes (8-core laptop, hg38)

| Scenario | Coverage | Mutations | Time |
|----------|----------|-----------|------|
| `small` preset | 1x, chr22 | 100 | ~30 s |
| Panel (chr7, chr12, chr17) | 500x | 50 | ~2 min |
| WGS 30x | 30x, all | 5 000 | ~25 min |
| WGS 60x | 60x, all | 5 000 | ~50 min |
| Ultra-deep panel 1000x | 1000x, 3 chroms | 50 | ~8 min |

### Memory usage

Peak memory is approximately:

```
2 x read_length x threads x (coverage / 30) MB
```

For 30x WGS with 150 bp reads on 8 threads: ~600 MB. For 1000x panel on 8 threads: ~200 MB (limited region).

---

## Comparison with Other Tools

| Scenario | Recommended tool |
|----------|-----------------|
| Realistic Illumina base-quality profiles | VarForge (parametric or `learn-profile`) |
| Controlled somatic variant spike-in | VarForge or BAMSurgeon |
| Spike into a real patient BAM | BAMSurgeon (preserves real read background) |
| Simple read generation, no mutations | ART or NanoSim |
| Whole-genome de novo simulation | NEAT or VarForge |
| cfDNA / liquid biopsy data | VarForge (only tool with native cfDNA model) |
| UMI-tagged duplex sequencing | VarForge (only tool with native duplex model) |
| FFPE artefact simulation | VarForge (only tool with FFPE + oxoG model) |
| Multi-sample longitudinal series | VarForge (only tool with native time-series) |
| Structural variant signatures | VarForge (only tool with HRD/TDP/chromothripsis models) |
| Long-read data | VarForge (log-normal fragment model) or NanoSim |

VarForge is the right choice when you need a complete, reproducible, ground-truth dataset for pipeline benchmarking and do not have access to real patient sequencing data. BAMSurgeon is the better choice when you need to insert a small number of variants into an existing real-data BAM while preserving the authentic read background.

---

## Licence

[MIT](LICENSE)