# Compare guide
The `aardvark compare` command is used to compare (or benchmark) a "query" call set against a "truth" call set.
The `compare` command will load the query and truth variant sets, group them into smaller clusters that are interrogated independently, and then perform the comparisons of the sets.
The main output is a summary statistic file reporting precision, recall, and F1 score measuring how similar and different the "query" and "truth" are.
Table of contents:
* [Quickstart](#quickstart)
* [Output files](#output-files)
* [Summary statistics file](#summary-statistics-file)
* [Labeled VCF files](#labeled-vcf-files)
* [Debug folder](#debug-folder)
* [Complex inputs](#complex-inputs)
* [Frequently asked questions](#frequently-asked-questions)
# Quickstart
The following command uses recommended default settings for comparing small variants only.
See our [recommended settings](./recommended_settings.md) for recommendations comparisons with other variant types.
```bash
aardvark compare \
--reference <FASTA> \
--truth-vcf <VCF> \
--query-vcf <VCF> \
--regions <BED> \
--output-dir <DIR>
```
Required parameters:
* `--reference <FASTA>` - Input reference genome FASTA file, gzip allowed
* `--truth-vcf <VCF>` - Input truth variant call set, which must be tabix-indexed
* `--query-vcf <VCF>` - Input query variant call set, which must be tabix-indexed
* `--regions <BED>` - Input regions to perform comparison in. Any variants from either truth or query that are not **fully-contained** within these regions will be excluded from the entire analysis.
* `--output-dir <DIR>` - Primary output folder containing the [summary statistics file](#summary-statistics-file) and [labeled VCF files](#labeled-vcf-files)
## Additional options
* `--threads <THREADS>` - The number of compute threads to use for the comparison step
* `--output-debug <DIR>` - Creates a [debug folder](#debug-folder) and populates it with more detailed statistics from the comparison
* `--truth-sample <SAMPLE>` and `--query-sample <SAMPLE>` - Allows for the specification of the truth or query sample name in the provided VCF. This is usually only needed when they are multi-sample VCFs.
* `--stratifications` - The root TSV file for a stratification. See [stratifications](#stratifications) for details.
# Output files
## Summary statistics file
The summary statistics file (`summary.tsv`) is a TSV output containing high-level summary metrics for the comparison including variant counts and distance metrics.
### Fields
* `compare_label` - A user-provided comparison label that is just passed through to this output. Specified via `--compare-label` option.
* `comparison` - The comparison type, which will be one of `GT` (genotype), `BASEPAIR` (sequence/basepair level), or one of the optional metrics. See [methods](./methods.md#comparison-types) for more details on each comparison type.
* `filter` - Indicates if any filter was applied.
* `region_label` - The region label from stratification inputs. By default, only `ALL` is provided which contains all variants analyzed. If [stratifications](#stratifications) are provided, then additional rows for each stratification label will be added and this column will contain the label.
* `variant_type` - The type of variant that the assessment corresponds to.
* `Snv` - Single Nucleotide Variant, requires both REF and ALT to be exactly 1 bp
* `Insertion` - Insertion variant, required REF to be exactly 1 bp and ALT to be >1 bp
* `Deletion` - Deletion variant, requires REF to be >1 bp and Alt to be exactly 1 bp
* `Indel` - Insertion/deletion variant, requires both REF and ALT to be >1 bp
* `JointIndel` - Group category that includes `Insertion`, `Deletion`, and `Indel` types
* `TrExpansion` - Tandem repeat expansion variant, requires TRID tag and ALT >= REF
* `TrContraction` - Tandem repeat contraction variant, requires TRID tag and ALT < REF
* `JointTandemRepeat` - Group category that includes `TrExpansion` and `TrContraction` types
* `SvInsertion` - Structural variant insertion, requires SVTYPE=INS tag and ALT >= REF
* `SvDeletion` - Structural variant deletion, requires SVTYPE=DEL tag and ALT <= REF
* `JointStructuralVariant` - Group category that includes `SvInsertion` and `SvDeletion` types
* `ALL` - Group category that includes all variants types
* `truth_total` - The total number of truth values that were assessed.
* `truth_tp` - The number of truth values that were also identified in the query set, indicating a true positive (TP).
* `truth_fn` - The number of truth values that were not identified in the query set, indicating a false negative (FN).
* `query_total` - The total number of query values that were assessed.
* `query_tp` - The number of query values that were matched to the truth set, indicating TP. For `comparison` mode `BASEPAIR`, this value should be equal to `truth_tp`. Other modes may be different due to variant representation.
* `query_fp` - The number of query values that were not matched to the truth set, indicating a false positive (FP).
* `metric_recall` - The recall metric, which is calculated as `truth_tp / truth_total`.
* `metric_precision` - The precision metric, which is calculated as `query_tp / (query_tp + query_fp)`.
* `metric_f1` - The F1 score metric, which is calculated as the harmonic mean of recall and precision. It is often used as an overall summary metric for comparisons.
* `truth_fn_gt` - The number of `truth_fn` where the allelic sequence was matched in both inputs, but with the wrong genotype (e.g., 1/1 in truth, 0/1 in query). This value is only populated if the `comparison` is `GT`.
* `query_fp_gt` - The number of `query_fp` where the allelic sequence was matched in both inputs, but with the wrong genotype (e.g., 0/1 in truth, 1/1 in query). This value is only populated if the `comparison` is `GT`.
### Example
This snippet of the summary file shows the `GT` and `BASEPAIR` comparison types, filtered to `ALL`, `Snv`, and `JointIndel` variant types.
```
compare_label comparison region_label filter variant_type truth_total truth_tp truth_fn query_total query_tp query_fp metric_recall metric_precision metric_f1 truth_fn_gt query_fp_gt
compare GT ALL ALL ALL 3751311 3745038 6273 3755783 3748198 7585 0.9983277846065016 0.9979804477521731 0.9981540859628386 2062 674
compare GT ALL ALL Snv 3256086 3254704 1382 3260867 3257726 3141 0.999575564036085 0.9990367592422505 0.9993060890111242 528 115
compare GT ALL ALL JointIndel 495225 490334 4891 494916 490472 4444 0.9901236811550306 0.9910206984619612 0.9905719867339353 1534 559
compare BASEPAIR ALL ALL ALL 12661284 12643818 17466 12658838 12643818 15020 0.9986205190563611 0.9988134771927724 0.9987169888043983
compare BASEPAIR ALL ALL Snv 9087804 9085371 2433 9097196 9091342 5854 0.9997322785570639 0.9993565050153915 0.9995443564686981
compare BASEPAIR ALL ALL JointIndel 3587078 3571059 16019 3577802 3568620 9182 0.9955342482098243 0.9974336198593438 0.9964830289490779
```
## Labeled VCF files
The output folder will also contain two VCF files with labeled variants, which are based on the genotype (GT) comparison mode.
Index files (.tbi) are automatically generated for each VCF.
* `truth.vcf.gz` - Contains variants from the truth call set that are labeled as true positives (TP) or false negatives (FN)
* `query.vcf.gz`- Contains variants from the query call set that are labeled as true positives (TP) or false positives (FP)
### VCF details
Each VCF file is constructed using either the truth or query VCF as a source.
The header of each VCF is adjusted to add the Aardvark version (`aardvark_version`), Aardvark command that was used (`aardvark_command`), and any new VCF fields.
Variants inside the VCF have been reformatted to match the internal aardvark representation.
Typically, this just means that multi-allelic sites have been split into multiple entries.
Additionally, Aardvark reports the following fields for each variant:
* `GT` - The GenoType from the input VCF file
* `BD` - The Benchmark Decision for the variant, which will be one of TP (true positive), FP (false positive), or FN (false negative)
* `EA` - The Expected Allele count, which is 2 for a homozygous variant and 1 for a heterozygous variant
* `OA` - The Observed Allele count. All entries in a TP file will have `EA==OA`, all entries in a FN file will have `EA > OA`, and all entries in a FP file will have `EA < OA`.
* `RI` - The Region Id the variant was grouped into for calculating all values. These will correspond to the `region_id` of the [region summary file](#region-summary-file).
### Example
```
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG001
chr1 783006 . A G . . . GT:BD:EA:OA:RI 1/1:TP:2:2:0
chr1 783175 . T C . . . GT:BD:EA:OA:RI 1/1:TP:2:2:1
chr1 784860 . T C . . . GT:BD:EA:OA:RI 1/1:TP:2:2:2
chr1 785417 . G A . . . GT:BD:EA:OA:RI 1/1:TP:2:2:3
chr1 797392 . G A . . . GT:BD:EA:OA:RI 0/1:TP:1:1:4
chr1 798618 . C T . . . GT:BD:EA:OA:RI 1/1:TP:2:2:5
chr1 798662 . G A . . . GT:BD:EA:OA:RI 1/1:TP:2:2:5
chr1 800046 . G A . . . GT:BD:EA:OA:RI 1/1:TP:2:2:6
chr1 801142 . A T . . . GT:BD:EA:OA:RI 1/1:TP:2:2:7
...
```
## Debug folder
The debug folder (`--debug-folder`) contains many files that may be useful for debugging errors from aardvark, or for getting greater details around specific comparisons that were performed.
The following files are currently created when this option is specified:
* `cli_settings.json` - JSON dump of the exact parameters that were used to run Aardvark
* `region_sequences.tsv.gz` - A [region-specific sequence file](#region-sequence-file) containing the Aardvark-constructed haplotype sequences for each sub-problem that was created as part of the Aardvark process. This file is gzip-compressed to reduce storage.
* `region_summary.tsv.gz` - A [region-specific summary file](#region-summary-file), containing summary statistics for each sub-problem that was created as part of the Aardvark process. This file is gzip-compressed to reduce storage.
### Region sequence file
This file captures the constructed haplotype sequences for each region.
Fields:
* `region_id` - A unique region identifier corresponding to the `RI` field of the [debug VCFs](#debug-vcf-details)
* `coordinates` - The coordinates of the corresponding region
* `ref_seq` - The reference sequence within the region, which is copied from the provided reference file
* `truth_seq1` - The first constructed truth haplotype sequence
* `truth_seq2` - The second constructed truth haplotype sequence
* `query_seq1` - The first constructed query haplotype sequence. In an exact match, this is expected to be identical to `truth_seq1`.
* `query_seq2` - The second constructed query haplotype sequence. In an exact match, this is expected to be identical to `truth_seq2`.
Partial example:
```
region_id coordinates ref_seq truth_seq1 truth_seq2 query_seq1 query_seq2
0 chr1:782956-783056 TAATTTTTTATATTGATTGTATACTGCAGTGATAATATTTTGGATGTATCAGGTTAAATAAAATTGACTGATTTCACCTTTTTCCTATTTTAAAAGTGGCT TAATTTTTTATATTGATTGTATACTGCAGTGATAATATTTTGGATGTATCGGGTTAAATAAAATTGACTGATTTCACCTTTTTCCTATTTTAAAAGTGGCT TAATTTTTTATATTGATTGTATACTGCAGTGATAATATTTTGGATGTATCGGGTTAAATAAAATTGACTGATTTCACCTTTTTCCTATTTTAAAAGTGGCT TAATTTTTTATATTGATTGTATACTGCAGTGATAATATTTTGGATGTATCGGGTTAAATAAAATTGACTGATTTCACCTTTTTCCTATTTTAAAAGTGGCT TAATTTTTTATATTGATTGTATACTGCAGTGATAATATTTTGGATGTATCGGGTTAAATAAAATTGACTGATTTCACCTTTTTCCTATTTTAAAAGTGGCT
1 chr1:783125-783225 GTTGATGAAAAATATTGTTGGTGAGCTCTGCTTAGGTAATATATAGGACATGAGCAGAGAGGAGGCACGTGAACAGTTCTGGCCTGGAGTAGGCTTCATTG GTTGATGAAAAATATTGTTGGTGAGCTCTGCTTAGGTAATATATAGGACACGAGCAGAGAGGAGGCACGTGAACAGTTCTGGCCTGGAGTAGGCTTCATTG GTTGATGAAAAATATTGTTGGTGAGCTCTGCTTAGGTAATATATAGGACACGAGCAGAGAGGAGGCACGTGAACAGTTCTGGCCTGGAGTAGGCTTCATTG GTTGATGAAAAATATTGTTGGTGAGCTCTGCTTAGGTAATATATAGGACACGAGCAGAGAGGAGGCACGTGAACAGTTCTGGCCTGGAGTAGGCTTCATTG GTTGATGAAAAATATTGTTGGTGAGCTCTGCTTAGGTAATATATAGGACACGAGCAGAGAGGAGGCACGTGAACAGTTCTGGCCTGGAGTAGGCTTCATTG
...
```
### Region summary file
This file captures summary metrics for each discrete region, similar to the overall summary statistics file.
Fields:
* `region_id` - A unique region identifier corresponding to the `RI` field of the [debug VCFs](#debug-vcf-details)
* `coordinates` - The coordinates of the corresponding region
* `comparison` - The comparison type for the metrics on the row, which will be one of `GT`, `BASEPAIR`, or one of the optional metrics. This is a summary for `ALL` variant types.
* `truth_total`, `truth_tp`, `truth_fn`, `query_tp`, `query_fp`, `metric_recall`, `metric_precision`, `metric_f1`, `truth_fn_gt`, `query_fp_gt` - See definitions for the [summary statistics file](#summary-statistics-file)
Partial example:
```
region_id coordinates comparison truth_total truth_tp truth_fn query_total query_tp query_fp metric_recall metric_precision metric_f1 truth_fn_gt query_fp_gt
0 chr1:782956-783056 GT 1 1 0 1 1 0 1.0 1.0 1.0 0 0
0 chr1:782956-783056 BASEPAIR 4 4 0 4 4 0 1.0 1.0 1.0
1 chr1:783125-783225 GT 1 1 0 1 1 0 1.0 1.0 1.0 0 0
1 chr1:783125-783225 BASEPAIR 4 4 0 4 4 0 1.0 1.0 1.0
...
```
# Complex inputs
## Stratifications
Sometimes it is useful to stratify the results into known regions for further analysis.
Aardvark accepts the stratification format [supported by Hap.py](https://github.com/Illumina/hap.py/blob/master/doc/happy.md#stratification-via-bed-regions) and distributed by [Genome in a Bottle](https://github.com/genome-in-a-bottle/genome-stratifications).
In short, the `--stratifications` parameter accepts the "root" TSV from the stratification, where each row contains a label and corresponding BED file.
This file-of-files may contain paths relative to the root TSV, and those BED files may be gzip-compressed.
A short example is in our [test files](../test_data/example_stratification/strat.tsv), and a real example is available through [GIAB](https://github.com/genome-in-a-bottle/genome-stratifications/blob/master/GRCh38/v3.1-GRCh38-all-stratifications.tsv).
Aardvark takes a haplotype-centric approach to all analysis, including stratification.
This means Aardvark compares the entire region against the stratifications when determining labels.
Most variants in a typical benchmark are isolated, so for those variants Aardvark will just check if the reference bases are fully-contained within the stratification regions.
If two or more variants are located in the same Aardvark region, it instead checks if the bases from the start of the first variant through the end of the last variant are fully-contained.
As a result, stratification regions that are small and co-located with increased variation may be under-reported in the stratifications.
# Frequently asked questions
## Which comparison mode is most like existing benchmark tools?
Other tools, like [hap.py](https://github.com/Illumina/hap.py) or [rtg vcfeval](https://github.com/RealTimeGenomics/rtg-tools), typically use the genotype accuracy for benchmarking.
Aardvark's GenoType (`GT`) comparison type is most similar to these other approaches, requiring that the alternate sequences exactly match while not allowing for partial credit.