vcf-reformatter 0.3.0

Fast VCF file parser and reformatter with VEP and SnpEff annotation support which can output to MAF
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
# VCF Reformatter: What is it?

Did it ever happen that you had VCF files and you wanted to have a look at the data as you would do with a normal table? `VCF Reformatter` is here for your rescue!

A Rust command-line tool for parsing and reformatting VCF (Variant Call Format) files, with support for VEP (Variant Effect Predictor) and SnpEff annotations. This tool flattens complex VCF files into tab-separated values (TSV) format for easier downstream analysis.
Also incredibly useful for quick checks to your data!

# VCF Reformatter

<div align="center">

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Rust](https://img.shields.io/badge/rust-1.70+-blue.svg)](https://www.rust-lang.org)
[![Build Status](https://img.shields.io/badge/build-passing-brightgreen.svg)]()
[![Performance](https://img.shields.io/badge/performance-10k--30k%20variants%2Fsec-green.svg)]()
[![Release](https://img.shields.io/github/v/release/flalom/vcf-reformatter)](https://github.com/flalom/vcf-reformatter/releases)

[![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-purple.svg?style=flat)](https://anaconda.org/bioconda/vcf-reformatter)
[![Conda](https://anaconda.org/bioconda/vcf-reformatter/badges/version.svg)](https://anaconda.org/bioconda/vcf-reformatter)

**Transform complex VCF files into clean, analyzable tables with ease**

*A high-performance Rust tool for flattening VCF files with intelligent VEP and SnpEff annotation handling*

</div>

---

## πŸš€ Quick Start

```` bash
# Download binary from releases (easiest! You download and use it)
wget https://github.com/flalom/vcf-reformatter/releases/latest/download/vcf-reformatter-v0.3.0-linux-x86_64
chmod +x vcf-reformatter-v0.3.0-linux-x86_64

# Transform your VCF file  
./vcf-reformatter-v0.3.0-linux-x86_64 sample.vcf.gz

# Generate MAF output ⚠️ (in beta!)
./vcf-reformatter-v0.3.0-linux-x86_64 sample.vcf.gz --output-format maf
````
OR Via Bioconda
```bash
conda install -c bioconda vcf-reformatter
# or
# mamba install vcf-reformatter -c bioconda
```
OR install from [crates.io](https://crates.io/crates/vcf-reformatter):
```bash
cargo install vcf-reformatter
```
OR build from source (you need Rust toolchain):
```` bash
git clone https://github.com/flalom/vcf-reformatter.git
cd vcf-reformatter
cargo build --release
./target/release/vcf-reformatter sample.vcf.gz
````
## ⚠️ Experimental MAF support
**MAF output is currently in beta testing (v0.3.0). Known limitations:**

- VAF calculation needs refinement for some genotype patterns
- Multi-sample handling requires validation
- Use with caution in production workflows

**Memory considerations for MAF:**
- Files >100K variants: Monitor memory usage
- Files >1M variants: Ensure adequate RAM (16GB+)


## 🎯 Why VCF Reformatter?

**The Problem:** VCF files are notoriously difficult to analyze. Complex nested annotations, semicolon-separated INFO fields, and multi-transcript VEP annotations make downstream analysis a nightmare.

**The Solution:** VCF Reformatter flattens everything into clean, readable TSV format that works seamlessly with Excel, R, Python, and any analysis tool (⚠️ beware Excel auto-correction!).

### Before & After

**Before (Raw VCF):**
```
chr1  69511  .  A  G  1294.53  .  DP=65;AF=1;CSQ=G|missense_variant|MODERATE|OR4F5|ENSG00000186092...
```
**After (Reformatted TSV):**
```
CHROM  POS    REF  ALT  QUAL     INFO_DP  INFO_AF  CSQ_Allele  CSQ_Consequence      CSQ_SYMBOL
chr1   69511  A    G    1294.53  65       1        G           missense_variant     OR4F5
```

## ✨ Key Features

| Feature                                 | Description                                      | Benefit                                              |
|-----------------------------------------|--------------------------------------------------|------------------------------------------------------|
| 🧬 **VEP/SnpEff Annotation Parsing**    | Intelligent handling of CSQ/ANN annotations      | No more manual parsing of complex VEP/SnpEff output  |
| πŸ‘€ **Automatic Annotation Recognition** | Automatic detection of CSQ/ANN annotations       | Saving even more time now for both VEP and SnpEff    |
| πŸ”€ **Smart Transcript Handling**        | Most severe, first only, or split transcripts    | Choose the analysis approach that fits your needs    |
| πŸš€ **Parallel Processing**              | Multi-threaded processing up to 30k variants/sec | Process large cohorts in minutes, not hours          |
| πŸ“ **Native Compression**               | Direct `.vcf.gz` reading & gzip output           | Seamless workflow with compressed/uncompressed files |
| 🎯 **Production Ready**                 | Comprehensive error handling & logging           | Reliable for automated pipelines                     |
| 🐳 **Container Support**                | Docker & Singularity ready                       | Deploy anywhere, from laptops to HPC clusters        |

---

## πŸ“¦ Installation

### Option 1: Download Pre-compiled Binaries (Easiest!)
**No Rust installation required** - just download and run:

1. **Go to [Releases]https://github.com/flalom/vcf-reformatter/releases/latest**
2. **Download the binary for your platform:**
    - `vcf-reformatter-v0.3.0-linux-x86_64` β†’ **Linux** (most users)
    - `vcf-reformatter-v0.3.0-linux-x86_64-static` β†’ **HPC clusters** (works everywhere)
    - `vcf-reformatter-v0.3.0-windows-x86_64.exe` β†’ **Windows**
    - `vcf-reformatter-v0.3.0-macos-x86_64` β†’ **Intel Mac**
    - `vcf-reformatter-v0.3.0-macos-arm64` β†’ **Apple Silicon Mac** (M1/M2/M3/M4)

3. **Make executable and run:**
````bash
# Linux/Mac
chmod +x vcf-reformatter-*
./vcf-reformatter-* --help

# Windows
# Just double-click or run from command prompt
# C++ might be required, if not already installed
````

### Option 2: **Build from Source**
````bash
git clone https://github.com/flalom/vcf-reformatter.git
cd vcf-reformatter
cargo build --release
````

### Option 3: Docker
```shell script
# Build the container
docker build -t vcf-reformatter .

# Run with your data
docker run --rm -v $(pwd):/data vcf-reformatter /data/sample.vcf.gz
```
### Option 4: Singularity
```shell script
# Build Singularity image
singularity build vcf-reformatter.sif Singularity

# Run on HPC cluster
singularity run --bind $PWD:/data vcf-reformatter.sif /data/sample.vcf.gz -j 16
```

## πŸ› οΈ Usage

### Basic Usage
```shell script
# Simple conversion
vcf-reformatter input.vcf.gz

# Most severe consequence only (recommended for analysis)
vcf-reformatter input.vcf.gz -t most-severe

# All transcripts in separate rows (comprehensive)
vcf-reformatter input.vcf.gz -t split
```
### Annotation Type Detection
```shell script
# Auto-detect annotation type (recommended)
vcf-reformatter input.vcf.gz -a auto

# Force VEP processing
vcf-reformatter vep_annotated.vcf.gz -a vep -t most-severe

# Force SnpEff processing  
vcf-reformatter snpeff_annotated.vcf.gz -a snpeff -t most-severe
```
### Advanced Usage
```shell script
# High-performance processing with compression
vcf-reformatter large_cohort.vcf.gz \
  --transcript-handling most-severe \
  --threads 0 \
  --compress \
  --output-dir results/ \
  --prefix my_analysis \
  --verbose

# Optimized for HPC environments
vcf-reformatter huge_dataset.vcf.gz -t most-severe -j 32 -o /scratch/results/ -c -v
```
### Complete Options
```
Usage: vcf-reformatter [OPTIONS] <INPUT_FILE>

Arguments:
  <INPUT_FILE>  Input VCF file (supports .vcf.gz)

Options:
  --output-format <FORMAT>     Output format [default: tsv] 
                               [values: tsv, maf]
  --center <CENTER>            Sequencing center for MAF output  
  --ncbi-build <BUILD>         Genome build 
                               [default: GRCh38]
  --sample-barcode <BARCODE>   Sample identifier for MAF output
  -t, --transcript-handling <MODE>  How to handle multiple transcripts
                                   [default: first]
                                   [values: most-severe, first, split]
  -a, --annotation-type <N>        Which annotations to parse VEP/SnpEff
                                   [default: auto]
                                   [values: snpeff, vep, auto]
  -j, --threads <N>                Thread count (0 = auto-detect) [default: 1]
  -o, --output-dir <DIR>           Output directory [default: current]
  -p, --prefix <PREFIX>            Output file prefix [default: input filename]
  -c, --compress                   Compress output with gzip
  -v, --verbose                    Detailed performance statistics
  -h, --help                       Show help
  -V, --version                    Show version
```

## 🧬 Transcript Handling Modes

VCF files with VEP annotations often contain multiple transcript annotations per variant. Choose the strategy that fits your analysis:

### 🎯 Most Severe (`--transcript-handling most-severe`)
**Best for:** Clinical analysis, variant prioritization
```shell script
vcf-reformatter input.vcf.gz -t most-severe

# for maf output
vcf-reformatter input.vcf.gz -t most-severe --output-format maf
```
Selects the transcript with the most severe consequence (stop_gained > missense_variant > synonymous, etc.)

### ⚑ First Only (`--transcript-handling first`) *[Default]*
**Best for:** Quick analysis, performance-critical workflows
```shell script
vcf-reformatter input.vcf.gz  # Uses first transcript by default
```

Processes only the first transcript annotation (fastest option)

### πŸ“Š Split All (`--transcript-handling split`)
**Best for:** Comprehensive analysis, transcript-level studies
```shell script
vcf-reformatter input.vcf.gz -t split
```
Creates separate rows for each transcript (most detailed output)

## πŸ“ˆ Performance

### Benchmarks
- **Small files** (< 1K variants): ~5,000 variants/sec
- **Medium files** (1K-10K variants): ~15,000 variants/sec
- **Large files** (10K+ variants): ~30,000 variants/sec

### Optimization Tips
```shell script
# Auto-detect optimal thread count
vcf-reformatter input.vcf.gz -j 0

# For files > 10K variants, use parallel processing
vcf-reformatter input.vcf.gz -t most-severe -j 0 -v

# Combine with compression for large outputs
vcf-reformatter input.vcf.gz -t split -j 0 -c -v
```

## πŸ“Š Output Format

### File Structure
VCF Reformatter generates two files:
- `{prefix}_header.txt` - Original VCF header and metadata
- `{prefix}_reformatted.tsv` - Flattened tabular data

### Column Types
1. **Standard VCF**: `CHROM`, `POS`, `ID`, `REF`, `ALT`, `QUAL`, `FILTER`
2. **INFO Fields**: `INFO_DP`, `INFO_AF`, `INFO_AC`, etc.
3. **VEP Annotations**: `CSQ_Allele`, `CSQ_Consequence`, `CSQ_SYMBOL`, `CSQ_Gene`, etc.
3. **SnpEff Annotations**: `ANN_Allele`, `ANN_Annotation_Impact`, `ANN_Gene_Name`, `ANN_Distance`, etc.
4. **Sample Data**: `SAMPLE1_GT`, `SAMPLE1_DP`, `SAMPLE1_AD`, etc.

### Example Output VEP
```
CHROM  POS    ID     REF  ALT  QUAL     FILTER  INFO_DP  CSQ_Consequence      CSQ_SYMBOL  SAMPLE1_GT
chr1   69511  .      A    G    1294.53  PASS    65       missense_variant     OR4F5       1/1
chr1   69761  rs123  C    T    892.15   PASS    42       synonymous_variant   OR4F5       0/1
```

### Example Output SnpEff
```
CHROM  POS    ID     REF  ALT  QUAL     FILTER  INFO_DP  ANN_Annotation          ANN_Gene_Name  SAMPLE1_GT
chr1   69761  rs587   C    T  730  PASS   .     214      synonymous_variant      OR4F5          0/1
chr1   924024  .      A    G  53   PASS   .     409      5_prime_UTR_variant     SAMD11         1/1
```

## πŸ”§ Integration Examples

### With R
```textmate
# Read compressed output directly
library(data.table)
data <- fread("output_reformatted.tsv.gz")

# Quick variant summary
summary(data$CSQ_Consequence)
```

### With Python
```textmate
import pandas as pd

# Load and analyze
df = pd.read_csv("output_reformatted.tsv.gz", sep="\t", compression="gzip")
df['CSQ_Consequence'].value_counts()
```

### In Workflows
```shell script
# Nextflow pipeline
vcf-reformatter ${vcf} -t most-severe -j ${task.cpus} -o results/ -c

# Snakemake rule
shell: "vcf-reformatter {input.vcf} -t most-severe -j {threads} -o {params.outdir} -c"
```

## 🐳 Container Usage

### Docker
```shell script
# Build once
docker build -t vcf-reformatter .

# Run anywhere
docker run --rm \
  -v $(pwd):/data \
  vcf-reformatter \
  /data/input.vcf.gz \
  -t most-severe -j 4 -o /data/results/ -c
```

### Singularity (HPC)
```shell script
# On HPC cluster
singularity run \
  --bind $PWD:/data \
  --bind /scratch:/scratch \
  vcf-reformatter.sif \
  /data/large_cohort.vcf.gz \
  -t most-severe -j 16 -o /scratch/results/ -c -v
```
## πŸ§ͺ Use Cases

| Use Case | Command | Why It Works |
|----------|---------|--------------|
| **Clinical Variant Review** | `vcf-reformatter variants.vcf.gz -t most-severe` | Prioritizes clinically relevant consequences |
| **Population Analysis** | `vcf-reformatter cohort.vcf.gz -t first -j 0 -c` | Fast processing of large cohorts |
| **Transcript Studies** | `vcf-reformatter genes.vcf.gz -t split -v` | Comprehensive transcript-level analysis |
| **Quick Data Exploration** | `vcf-reformatter sample.vcf.gz` | Simple, fast conversion for immediate analysis |
| **HPC Batch Processing** | `vcf-reformatter huge.vcf.gz -t most-severe -j 32 -c` | Optimized for high-performance computing |

## πŸš€ What's New in v0.3.0
- βœ… **MAF Output Support (in Beta⚠️)** - Direct conversion to Mutation Annotation Format
- βœ… **Auto-metadata Detection (in Beta⚠️)** - Extracts center/sample info from VCF headers for MAF
- βœ… **Memory-Efficient Processing (streaming)** - Chunked streaming for large files (>>100K variants)
- βœ… **Enhanced Error Handling** - Better processing of malformed files
- βœ… **Comprehensive Testing** - 70+ test cases ensure reliability

## Previous Releases
### πŸš€ What's New in v0.2.0
- βœ… **SnpEff Support** - Full ANN field parsing with intelligent detection
- βœ… **Smart Auto-Detection** - Automatically identifies VEP vs SnpEff annotations
- βœ… **Enhanced Error Handling** - Better processing of malformed or headerless files

## TODOs
- ~~Add SnpEff supportβœ…~~
- ~~Output MAF format optionβœ…~~
- Add `stdin` to combine with other tools, such as `bcftools`
- Support for multi-sample VCF files in MAF output

## 🀝 Contributing

We welcome contributions! Here's how to get started:

1. **Fork** the repository
2. **Create** a feature branch: `git checkout -b feature-name`
3. **Add tests** for new functionality
4. **Commit** your changes: `git commit -am 'Add feature'`
5. **Push** to the branch: `git push origin feature-name`
6. **Submit** a pull request

### Development Setup
```shell script
git clone https://github.com/flalom/vcf-reformatter.git
cd vcf-reformatter
cargo test  # Run the test suite
cargo run -- data/sample.vcf.gz -v  # Test with sample data
```

## πŸ“ License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

---

## πŸ™ Acknowledgments

- **VCF Format Contributors** - For the standard that enables genomic data sharing
- **VEP Team** - For the powerful variant annotation framework
- **Rust Community** - For the incredible ecosystem that makes this possible
- **Bioinformatics Community** - For feedback and feature requests

---

## Frequently Asked Questions

### Q: Which transcript handling mode should I use?
- **Clinical analysis**: `--transcript-handling most-severe`
- **Quick exploration**: `--transcript-handling first`
- **Comprehensive analysis**: `--transcript-handling split`

### Q: How does this compare to other VCF tools?
VCF Reformatter is specifically designed for:
- Converting complex VEP/SnpEff annotations to tabular format
- Handling multiple transcripts intelligently
- High-performance parallel processing
- Easy integration with R/Python workflows

### Q: Can I use this in production pipelines?
Yes! VCF Reformatter is designed for production use with:
- Comprehensive error handling
- Docker/Singularity support
- Automated testing
- Stable CLI interface

### Q: What's the difference between TSV and MAF output?
- **TSV**: Direct flattening of VCF fields (default)
- **MAF (beta)**: Standardized cancer genomics format for downstream tools

### Q: What if I get out-of-memory errors?
- Use TSV format instead of MAF: `vcf-reformatter file.vcf.gz -j 0 -c`
- Enable verbose mode to monitor: `vcf-reformatter file.vcf.gz -v`

___

## πŸ“ž Support

- **πŸ“‹ Issues**: [GitHub Issues]https://github.com/flalom/vcf-reformatter/issues
- **πŸ“§ Email**: [fl@flaviolombardo.site]mailto:fl@flaviolombardo.site

---

<div align="center">

**⭐ Star this repo if VCF Reformatter helps your research!**

Made with ❀️ by [Flavio Lombardo](https://github.com/flalom)

</div>