Holodeck
Modern NGS read simulator written in Rust.
Visit us at Fulcrum Genomics to learn more about how we can power your Bioinformatics with holodeck and beyond.
Overview
Holodeck generates simulated sequencing reads from a reference genome with optional variants from a VCF file. It supports Illumina-style paired-end and single-end reads with a position-dependent error model, targeted sequencing with BED files, and arbitrary ploidy with per-contig overrides.
Simulated reads include ground-truth information encoded in read names, and optionally in a golden BAM file, enabling downstream evaluation of alignment and variant-calling accuracy.
Installation
From source
Requires Rust 1.94.0 or later.
# Binary is at target/release/holodeck
Commands
| Command | Description |
|---|---|
simulate |
Generate simulated reads from a reference genome with optional VCF variants |
mutate |
Generate a random VCF of mutations from a reference genome |
eval |
Evaluate alignment accuracy of simulated reads against truth |
Quick Start
# Index your reference (if not already done)
# Generate random mutations
# Simulate 30x paired-end reads with ground-truth BAM
# Outputs: sim.r1.fastq.gz, sim.r2.fastq.gz, sim.golden.bam
# Align reads and evaluate accuracy
|
# Output: eval_results.eval.txt
Simulate
Generate paired-end or single-end FASTQ files from a reference genome with optional variants from a VCF. Reads are sampled with a position-dependent Illumina error model where the error rate ramps from a minimum at the start of reads to a maximum at the end.
# Whole-genome 30x with variants
# Targeted sequencing (exome/panel)
# Single-end, no errors, with golden BAM
# Custom fragment size distribution and compression threads
Key options:
| Option | Default | Description |
|---|---|---|
-r, --reference |
required | Indexed FASTA reference |
-v, --vcf |
none | VCF with variants to apply |
-b, --targets |
none | BED file of target regions |
-o, --output |
required | Output prefix |
-c, --coverage |
30 | Mean coverage depth |
-l, --read-length |
150 | Read length in bases |
-d, --fragment-mean |
300 | Mean fragment size |
-s, --fragment-stddev |
50 | Fragment size standard deviation |
--min-error-rate |
0.001 | Error rate at start of reads |
--max-error-rate |
0.01 | Error rate at end of reads |
--golden-bam |
off | Write ground-truth BAM |
--single-end |
off | Generate SE instead of PE reads |
--simple-names |
off | Use holodeck::N names instead of encoded truth |
--compression |
1 | BGZF compression level (0-12) |
-t, --threads |
4 | Threads for BGZF compression |
--seed |
auto | Random seed (deterministic by default) |
Mutate
Generate a VCF of random mutations from a reference genome. The output can be fed directly to holodeck simulate. Supports independent control of SNP, indel, and MNP rates with configurable ploidy.
# Default rates
# Realistic human-like rates (~1 variant per 1000bp, 10:1 SNP:indel)
# Male sample with haploid chrX/chrY
# Restrict mutations to target regions
Key options:
| Option | Default | Description |
|---|---|---|
--snp-rate |
0.001 | SNP rate per base |
--indel-rate |
0.0001 | Indel rate per base |
--mnp-rate |
0.00005 | MNP rate per base |
--het-hom-ratio |
2.0 | Ratio of heterozygous to homozygous variants |
--ploidy |
2 | Default ploidy |
--ploidy-override |
none | Per-contig/region ploidy (e.g. chrX=1) |
--indel-length-param |
0.7 | Geometric distribution parameter for indel lengths |
Eval
Evaluate alignment accuracy by comparing mapped positions against truth positions encoded in read names. Reports accuracy stratified by MAPQ bin.
Key options:
| Option | Default | Description |
|---|---|---|
-m, --mapped |
required | BAM file of mapped reads |
-o, --output |
required | Output prefix (writes .eval.txt) |
--wiggle |
5 | Max distance (bp) for a correct mapping |
Features
- Position-dependent error model -- error rate ramps across the read, with R2 having higher rates than R1 (configurable multiplier)
- Multi-sample VCF support -- select a sample with
--samplefrom multi-sample VCFs - Arbitrary ploidy -- phased and unphased genotypes with any ploidy; per-contig and per-region overrides for sex chromosomes and PAR regions
- BED target regions -- efficient padded-interval sampling for exome/panel simulations
- Adapter simulation -- configurable adapter sequences appended when fragment < read length
- Encoded read names -- truth coordinates embedded in read names for downstream evaluation without needing the golden BAM
- Golden BAM output -- ground-truth alignments with correct positions, CIGARs (including variant-induced indels), and adapter soft-clips
- Multi-threaded compression -- BGZF output compression parallelized across configurable threads via
pooled-writer - Deterministic by default -- same parameters always produce the same output via hash-derived seeding
- Sparse haplotype representation -- variants stored in interval trees overlaid on the reference; no full-genome copies
Read Name Format
Read names encode ground-truth alignment information for use by holodeck eval and other tools.
Encoded format (default):
PE: @holodeck::READ_NUM::FRAG_LEN::CONTIG::POS1+STRAND::POS2+STRAND::HAP::ERRS1::ERRS2
SE: @holodeck::READ_NUM::FRAG_LEN::CONTIG::POS+STRAND::HAP::ERRS
Example:
@holodeck::42::600::chr1::10000F::10450R::0::2::1
Fields: read number, source fragment length (bp), contig, R1 position (1-based) + strand (F/R), R2 position + strand, haplotype index, R1 errors, R2 errors.
FRAG_LEN is the length of the originating template. When it is smaller than the read length, the remainder of each read (read_length - FRAG_LEN bases) is adapter sequence (possibly padded with N if the configured adapter is shorter than that remainder). This lets consumers locate the adapter boundary in both PE and SE reads without the golden BAM.
Fields are separated by :: (double colon) so contig names that legally contain single : characters (e.g. HLA alleles like HLA-A*01:01:01:01) parse unambiguously. Contig names must not contain @ (the FASTQ header prefix) or ::; both are rejected by a debug assertion in the read-name formatter.
Simple format (--simple-names):
@holodeck::42
Performance
Holodeck is designed for high throughput. Typical performance on Apple M-series hardware:
| Scenario | Reads/sec |
|---|---|
| FASTQ only | ~330K read pairs/sec |
| FASTQ + golden BAM | ~170K read pairs/sec |
Performance scales linearly with read count. Adding realistic variant density (~1 per 1,000bp) has no measurable impact on throughput.