# fgumi simulate CLI Reference
Generate synthetic sequencing data for testing and benchmarking the fgumi pipeline.
**Requires:** `cargo build --features simulate`
## Commands Overview
| `fgumi simulate fastq-reads` | R1/R2 FASTQ.gz | Input for `extract` | Required |
| `fgumi simulate mapped-reads` | Template-coord sorted BAM | Input for `group` | Required |
| `fgumi simulate grouped-reads` | Template-coord sorted BAM with MI tags | Input for `simplex`/`duplex`/`codec` | Required |
| `fgumi simulate consensus-reads` | Mapped BAM with consensus tags | Input for `filter` | Required |
| `fgumi simulate correct-reads` | Unmapped BAM + includelist | Input for `correct` | Not used |
**Note:** All simulate subcommands (except `correct-reads`) require `--reference` / `-r` pointing to a reference FASTA file. Positions are sampled from real chromosomes, template sequences are extracted from the reference, and BAM headers contain actual contig names and lengths. Read orientations are a 50/50 mix of F1R2 and R1F2 (strand coin flip per molecule).
---
## fgumi simulate fastq-reads
Generate paired-end FASTQ files with UMI sequences for input to `fgumi extract`.
### Usage
```bash
fgumi simulate fastq-reads \
--r1 output_R1.fastq.gz \
--r2 output_R2.fastq.gz \
[OPTIONS]
```
### Required Arguments
| `-1, --r1` | PATH | Output R1 FASTQ.gz file |
| `-2, --r2` | PATH | Output R2 FASTQ.gz file |
| `--truth` | PATH | Output truth TSV file (for validation) |
| `-r, --reference` | PATH | Reference FASTA file (sequences sampled from here) |
### Simulation Options
| `-n, --num-molecules` | INT | 1000 | Number of unique molecules to simulate |
| `-l, --read-length` | INT | 150 | Length of each read in bases |
| `-u, --umi-length` | INT | 8 | Length of UMI sequence in bases |
| `--read-structure-r1` | STRING | `8M+T` | Read structure for R1 (fgbio notation) |
| `--read-structure-r2` | STRING | `+T` | Read structure for R2 (fgbio notation) |
| `--seed` | INT | (random) | Random seed for reproducibility |
### Quality Model Options
| `--warmup-bases` | INT | 10 | Number of bases before peak quality is reached |
| `--warmup-quality` | INT | 25 | Starting quality score during warmup phase |
| `--peak-quality` | INT | 37 | Peak quality score (Phred) |
| `--decay-start` | INT | 100 | Position where quality decay begins |
| `--decay-rate` | FLOAT | 0.08 | Quality drop per base after decay starts |
| `--quality-noise` | FLOAT | 2.0 | Standard deviation of quality noise |
| `--r2-quality-offset` | INT | -2 | Quality offset for R2 reads (typically negative) |
### Family Size Options
| `--family-size-dist` | STRING | `lognormal` | Distribution: `lognormal`, `negbin`, or path to histogram |
| `--family-size-mean` | FLOAT | 3.0 | Mean family size (for lognormal) |
| `--family-size-stddev` | FLOAT | 2.0 | Family size standard deviation (for lognormal) |
| `--family-size-r` | FLOAT | 2.0 | r parameter (for negative binomial) |
| `--family-size-p` | FLOAT | 0.5 | p parameter (for negative binomial) |
| `--min-family-size` | INT | 1 | Minimum reads per family |
### Insert Size Options
| `--insert-size-mean` | FLOAT | 300.0 | Mean insert size |
| `--insert-size-stddev` | FLOAT | 50.0 | Insert size standard deviation |
| `--insert-size-min` | INT | 50 | Minimum insert size |
| `--insert-size-max` | INT | 800 | Maximum insert size |
### Methylation Options
| `--methylation-mode` | `em-seq` or `taps` | (disabled) | Methylation chemistry mode; disabled by default |
| `--cpg-methylation-rate` | FLOAT | 0.75 | Fraction of CpG cytosines that are methylated [0.0-1.0] |
| `--conversion-rate` | FLOAT | 0.98 | Enzymatic conversion efficiency for target cytosines [0.0-1.0] |
### Truth File Format
The truth TSV file contains ground truth for validation:
| `read_name` | Read name (matches FASTQ header) |
| `true_umi` | The true UMI sequence (before any errors) |
| `molecule_id` | Unique molecule identifier |
| `family_id` | Family within the molecule |
| `strand` | Strand (A or B) |
| `chrom` | Chromosome/contig name |
| `pos` | 0-based genomic position |
### Example
```bash
# Generate 10,000 molecules with 8bp UMIs
fgumi simulate fastq-reads \
--r1 sim_R1.fastq.gz \
--r2 sim_R2.fastq.gz \
--truth sim_truth.tsv \
--reference hg38.fa \
--num-molecules 10000 \
--umi-length 8 \
--read-structure-r1 "8M142T" \
--read-structure-r2 "150T" \
--seed 42
```
---
## fgumi simulate mapped-reads
Generate template-coordinate sorted BAM with paired alignments for input to `fgumi group`.
### Usage
```bash
fgumi simulate mapped-reads \
--output output.bam \
[OPTIONS]
```
### Required Arguments
| `-o, --output` | PATH | Output BAM file (template-coordinate sorted) |
| `--truth` | PATH | Output truth TSV file (for validation) |
| `-r, --reference` | PATH | Reference FASTA file (sequences sampled from here) |
### Simulation Options
| `-n, --num-molecules` | INT | 1000 | Number of unique molecules to simulate |
| `-l, --read-length` | INT | 150 | Length of each read in bases |
| `-u, --umi-length` | INT | 8 | Length of UMI sequence in bases |
| `--seed` | INT | (random) | Random seed for reproducibility |
| `-t, --threads` | INT | 1 | Number of writer threads |
Positions are sampled from real chromosomes (weighted by length), read sequences are extracted from the reference, and the BAM header contains actual contig names and lengths.
### Alignment Options
| `--mapq` | INT | 60 | Mapping quality for aligned reads |
| `--unmapped-fraction` | FLOAT | 0.0 | Fraction of reads to leave unmapped |
### Position Distribution Options
| `--num-positions` | INT | (derived) | Number of genomic positions to use (default: num-molecules) |
| `--umis-per-position` | INT | 1 | Number of unique UMIs per position |
By default, each molecule gets a unique position. For high-depth benchmarking (testing MIH optimization in `group`), use fewer positions with many UMIs per position.
### Quality Model Options
(Same as fastq-reads)
| `--warmup-bases` | INT | 10 | Number of bases before peak quality |
| `--warmup-quality` | INT | 25 | Starting quality during warmup |
| `--peak-quality` | INT | 37 | Peak quality score |
| `--decay-start` | INT | 100 | Position where decay begins |
| `--decay-rate` | FLOAT | 0.08 | Quality drop per base |
| `--quality-noise` | FLOAT | 2.0 | Quality noise std dev |
| `--r2-quality-offset` | INT | -2 | R2 quality offset |
### Family Size Options
(Same as fastq-reads)
### Insert Size Options
(Same as fastq-reads)
### Methylation Options
(Same as fastq-reads)
### Output Tags
| `RX` | String | Raw UMI sequence |
| `RG` | String | Read group (default: "A") |
### Truth File Format
| `read_name` | Read name (matches BAM QNAME) |
| `true_umi` | The true UMI sequence (before any errors) |
| `molecule_id` | Unique molecule identifier |
| `chrom` | Chromosome/contig name |
| `position` | 0-based genomic position |
| `strand` | Strand (`+` or `-`) |
### Example
```bash
# Generate mapped reads from a reference FASTA
fgumi simulate mapped-reads \
--output sim_mapped.bam \
--truth sim_truth.tsv \
--reference hg38.fa \
--num-molecules 5000 \
--seed 42 \
--threads 4
# High-depth mode: many UMIs at few positions (for MIH/group benchmarking)
fgumi simulate mapped-reads \
--output sim_high_depth.bam \
--truth sim_truth.tsv \
--reference hg38.fa \
--num-molecules 50000 \
--num-positions 100 \
--umis-per-position 500 \
--seed 42
```
---
## fgumi simulate grouped-reads
Generate template-coordinate sorted BAM with MI (molecule ID) tags for input to consensus callers (`simplex`, `duplex`, `codec`).
### Usage
```bash
fgumi simulate grouped-reads \
--output output.bam \
[OPTIONS]
```
### Required Arguments
| `-o, --output` | PATH | Output BAM file (template-coordinate sorted) |
| `--truth` | PATH | Output truth TSV file (for validation) |
| `-r, --reference` | PATH | Reference FASTA file (sequences sampled from here) |
### Simulation Options
| `-n, --num-molecules` | INT | 1000 | Number of unique molecules to simulate |
| `-l, --read-length` | INT | 150 | Length of each read in bases |
| `-u, --umi-length` | INT | 8 | Length of UMI sequence in bases |
| `--seed` | INT | (random) | Random seed for reproducibility |
| `-t, --threads` | INT | 1 | Number of writer threads |
### Duplex Options
| `--duplex` | FLAG | false | Generate duplex-style MI tags (e.g., "1/A", "1/B") |
| `--strand-alpha` | FLOAT | 5.0 | Beta distribution alpha for A/B strand ratio |
| `--strand-beta` | FLOAT | 5.0 | Beta distribution beta for A/B strand ratio |
### Quality Model Options
(Same as fastq-reads)
### Family Size Options
(Same as fastq-reads)
### Insert Size Options
(Same as fastq-reads)
### Methylation Options
(Same as fastq-reads)
### Output Tags
| `RX` | String | Raw UMI sequence |
| `MI` | String | Molecule ID (integer for simplex, "N/A" or "N/B" for duplex) |
| `RG` | String | Read group (default: "A") |
### Truth File Format
| `read_name` | Read name (matches BAM QNAME) |
| `true_umi` | The true UMI sequence (before any errors) |
| `molecule_id` | Unique molecule identifier |
| `expected_mi` | Expected MI tag value after grouping |
| `chrom` | Chromosome/contig name |
| `position` | 0-based genomic position |
| `strand` | Strand (`+` or `-`) |
### Example
```bash
# Generate simplex grouped reads
fgumi simulate grouped-reads \
--output sim_grouped.bam \
--truth sim_truth.tsv \
--reference hg38.fa \
--num-molecules 5000 \
--seed 42
# Generate duplex grouped reads with strand bias
fgumi simulate grouped-reads \
--output sim_duplex_grouped.bam \
--truth sim_truth.tsv \
--reference hg38.fa \
--num-molecules 5000 \
--duplex \
--strand-alpha 5.0 \
--strand-beta 5.0 \
--seed 42
```
---
## fgumi simulate consensus-reads
Generate mapped BAM with consensus tags (cD, cM, cE, etc.) for input to `fgumi filter`.
### Usage
```bash
fgumi simulate consensus-reads \
--output output.bam \
--reference ref.fa \
[OPTIONS]
```
### Required Arguments
| `-o, --output` | PATH | Output BAM file (mapped) |
| `-r, --reference` | PATH | Reference FASTA file (sequences sampled from here) |
### Simulation Options
| `-n, --num-reads` | INT | 1000 | Number of consensus read pairs to generate |
| `-l, --read-length` | INT | 150 | Length of each read in bases |
| `--seed` | INT | (random) | Random seed for reproducibility |
| `-t, --threads` | INT | 1 | Number of writer threads |
### Consensus Tag Options
| `--min-depth` | INT | 1 | Minimum consensus depth (cM tag) |
| `--max-depth` | INT | 10 | Maximum consensus depth (cD tag) |
| `--depth-mean` | FLOAT | 5.0 | Mean depth for sampling |
| `--depth-stddev` | FLOAT | 2.0 | Depth standard deviation |
| `--error-rate-mean` | FLOAT | 0.01 | Mean error rate (cE tag) |
| `--error-rate-stddev` | FLOAT | 0.005 | Error rate standard deviation |
### Duplex Options
| `--duplex` | FLAG | false | Generate duplex consensus tags (aD, bD, aM, bM, aE, bE) |
| `--strand-alpha` | FLOAT | 5.0 | Beta distribution alpha for A/B depth ratio |
| `--strand-beta` | FLOAT | 5.0 | Beta distribution beta for A/B depth ratio |
### Quality Options
| `--consensus-quality` | INT | 40 | Base quality for consensus reads |
### Methylation Options
| `--methylation-mode` | `em-seq` or `taps` | (disabled) | Methylation chemistry mode; disabled by default |
| `--cpg-methylation-rate` | FLOAT | 0.75 | Fraction of CpG cytosines that are methylated [0.0-1.0] |
| `--conversion-rate` | FLOAT | 0.98 | Enzymatic conversion efficiency for target cytosines [0.0-1.0] |
| `--methylation-depth-mean` | FLOAT | 5.0 | Mean depth for methylation count sampling (cu + ct per position) |
### Output Tags (Simplex)
| `cD` | Int | Maximum per-base depth |
| `cM` | Int | Minimum per-base depth |
| `cE` | Float | Consensus error rate |
| `cd` | IntArray | Per-base depth array |
| `ce` | IntArray | Per-base error count array |
### Output Tags (Duplex, in addition to above)
| `aD` | Int | A-strand maximum depth |
| `aM` | Int | A-strand minimum depth |
| `aE` | Float | A-strand error rate |
| `bD` | Int | B-strand maximum depth |
| `bM` | Int | B-strand minimum depth |
| `bE` | Float | B-strand error rate |
| `ad` | IntArray | A-strand per-base depth |
| `bd` | IntArray | B-strand per-base depth |
### Example
```bash
# Generate simplex consensus reads
fgumi simulate consensus-reads \
--output sim_consensus.bam \
--reference hg38.fa \
--num-reads 10000 \
--min-depth 2 \
--max-depth 20 \
--seed 42
# Generate duplex consensus reads
fgumi simulate consensus-reads \
--output sim_duplex_consensus.bam \
--reference hg38.fa \
--num-reads 10000 \
--duplex \
--seed 42
```
---
## fgumi simulate correct-reads
Generate unmapped BAM and UMI includelist for input to `fgumi correct`.
### Usage
```bash
fgumi simulate correct-reads \
--output output.bam \
--includelist umis.txt \
[OPTIONS]
```
### Required Arguments
| `-o, --output` | PATH | Output BAM file (unmapped) |
| `-i, --includelist` | PATH | Output UMI includelist file (one UMI per line) |
| `--truth` | PATH | Output truth TSV file (for validation) |
### Simulation Options
| `-n, --num-reads` | INT | 10000 | Number of reads to generate |
| `--num-umis` | INT | 1000 | Number of unique UMIs in includelist |
| `-u, --umi-length` | INT | 8 | Length of UMI sequence in bases |
| `--read-length` | INT | 100 | Length of template sequence |
| `--seed` | INT | (random) | Random seed for reproducibility |
| `-t, --threads` | INT | 1 | Number of writer threads |
### Error Distribution Options
| `--exact-fraction` | FLOAT | 0.4 | Fraction with exact UMI match (0 edits) |
| `--edit1-fraction` | FLOAT | 0.3 | Fraction with 1 edit distance |
| `--edit2-fraction` | FLOAT | 0.2 | Fraction with 2 edit distance |
| `--multi-fraction` | FLOAT | 0.1 | Fraction with 3+ edits (should not correct) |
### Quality Options
| `--quality` | INT | 30 | Base quality for all bases |
### Output Tags
| `RX` | String | Observed (possibly erroneous) UMI sequence |
### Includelist Format
The includelist is a plain text file with one UMI per line, sorted alphabetically:
```
AAAACCCC
AAAACCCT
AAAACCGG
...
```
### Truth File Format
| `read_name` | Read name (matches BAM QNAME) |
| `true_umi` | The correct UMI from the includelist |
| `observed_umi` | The (possibly erroneous) UMI in the RX tag |
| `expected_correction` | Expected UMI after correction |
| `edit_distance` | Edit distance between observed and true |
| `error_type` | Type: `exact`, `edit1`, `edit2`, or `multi` |
### Example
```bash
# Generate correction test data
fgumi simulate correct-reads \
--output sim_correct.bam \
--includelist sim_umis.txt \
--truth sim_truth.tsv \
--num-reads 100000 \
--num-umis 5000 \
--umi-length 12 \
--exact-fraction 0.5 \
--edit1-fraction 0.3 \
--edit2-fraction 0.15 \
--multi-fraction 0.05 \
--seed 42
```
---
## Shared Concepts
### Read Structure Notation
Uses fgbio-style read structure notation:
- `M` = Molecular barcode (UMI)
- `T` = Template
- `S` = Skip (ignored bases)
- `B` = Sample barcode
- `+` = Variable length (consumes remaining bases)
Examples:
- `8M142T` = 8bp UMI followed by 142bp template
- `8M+T` = 8bp UMI followed by variable-length template
- `+T` = All template (no UMI)
### Family Size Distributions
Three distribution types are supported:
1. **Log-normal** (default): Natural for PCR amplification
- Parameters: `--family-size-mean`, `--family-size-stddev`
2. **Negative binomial**: Alternative PCR model
- Parameters: `--family-size-r`, `--family-size-p`
3. **Empirical**: Load from `fgumi group -f` histogram output
- Parameter: `--family-size-dist /path/to/histogram.tsv`
### Quality Score Model
Quality scores follow a three-phase model:
1. **Warmup** (positions 0 to `warmup-bases`): Quality ramps from `warmup-quality` to `peak-quality`
2. **Peak** (positions `warmup-bases` to `decay-start`): Quality stays at `peak-quality`
3. **Decay** (positions after `decay-start`): Quality decreases by `decay-rate` per base
R2 reads have an additional offset (`r2-quality-offset`, typically -2) applied.
### Template-Coordinate Sorting
For `mapped-reads` and `grouped-reads`, output is sorted by template coordinate:
- Primary sort: 5' position of the leftmost read in the pair
- Secondary sort: Read name (for determinism)
BAM header includes: `SO:unsorted`, `GO:query`, `SS:template-coordinate`
### High-Depth Benchmarking
To benchmark the MIH (Multiple Identical Hits) optimization in `fgumi group`, use the position distribution options to create data with many UMIs at few positions:
```bash
# 100 positions, each with 500 unique UMIs, ~10 reads per UMI = 500,000 reads
fgumi simulate mapped-reads \
--output high_depth.bam \
--truth high_depth_truth.tsv \
--reference hg38.fa \
--num-molecules 500000 \
--num-positions 100 \
--umis-per-position 500 \
--family-size-mean 10 \
--seed 42
```
This creates the kind of position-clustered data that stresses the UMI assignment algorithm and tests the MIH optimization path in `group`.