fgumi 0.2.0

High-performance tools for UMI-tagged sequencing data: extraction, grouping, and consensus calling
Documentation
# Getting Started

This guide walks through a basic fgumi workflow from FASTQ files to filtered consensus reads.

## Prerequisites

- fgumi installed (see [Installation]../index.md#installation)
- A reference genome FASTA (with BWA index)
- Paired-end FASTQ files with UMI sequences

## Basic Workflow

### 1. Extract UMIs from FASTQ

Extract UMIs from FASTQ reads and create an unmapped BAM. The `--read-structures` argument tells fgumi where UMI bases are located in each read. See [Read Structures](read-structures.md) for details.

```bash
fgumi extract \
  --inputs R1.fastq.gz R2.fastq.gz \
  --read-structures +T +M \
  --output unaligned.bam \
  --sample MySample \
  --library MyLibrary
```

### 2. (Optional) Correct UMIs

If using a fixed set of known UMIs, correct sequencing errors:

```bash
fgumi correct \
  --input unaligned.bam \
  --output corrected.bam \
  --umi-files umis.txt \
  --min-distance 1
```

### 3. Align and Sort

Use fgumi's streaming pipeline to align with BWA and sort into template-coordinate order in a single pass:

```bash
fgumi fastq --input unaligned.bam \
  | bwa mem -p ref.fa - \
  | fgumi zipper --unmapped unaligned.bam \
  | fgumi sort --output sorted.bam --order template-coordinate
```

This pipes reads through:
1. `fastq` — converts unmapped BAM to interleaved FASTQ
2. `bwa mem` — aligns reads to the reference
3. `zipper` — merges aligned reads with original unmapped BAM to restore UMI tags
4. `sort` — sorts into template-coordinate order for grouping

> **Note:** `fgumi zipper` accepts SAM input (piped from the aligner) or a BAM file via `--input`.
> Piping SAM directly from the aligner is preferred for best performance; BAM input is
> functional but involves an extra decode step.

For single-cell data, the `CB` cell barcode tag is automatically included in the
template-coordinate sort key, keeping templates from different cells at the same locus separate:

```bash
fgumi fastq --input unaligned.bam \
  | bwa mem -p ref.fa - \
  | fgumi zipper --unmapped unaligned.bam \
  | fgumi sort --output sorted.bam --order template-coordinate
```

### 3b. (Optional) Merge Multiple BAMs

If processing multiple lanes or flowcells separately, merge the sorted BAMs before grouping:

```bash
fgumi merge \
  --order template-coordinate \
  --output merged.bam \
  lane1_sorted.bam lane2_sorted.bam lane3_sorted.bam
```

All inputs must be sorted in the same order. For large numbers of files, use `--input-list`:

```bash
fgumi merge \
  --order template-coordinate \
  --input-list bam_paths.txt \
  --output merged.bam
```

For single-cell data, the `CB` cell barcode tag is automatically included in the merge key.

### 4. Group Reads by UMI

Group reads from the same original molecule together.

For duplex workflows, use `paired` strategy:

```bash
fgumi group \
  --input sorted.bam \
  --output grouped.bam \
  --strategy paired
```

For simplex/codec workflows, use `adjacency` strategy:

```bash
fgumi group \
  --input sorted.bam \
  --output grouped.bam \
  --strategy adjacency
```

To collect all grouping QC metrics under a single prefix:

```bash
fgumi group \
  --input sorted.bam \
  --output grouped.bam \
  --strategy adjacency \
  --metrics group_metrics
```

This writes `group_metrics.family_sizes.txt`, `group_metrics.grouping_metrics.txt`, and
`group_metrics.position_group_sizes.txt` in one step.

See [UMI Grouping](umi-grouping.md) for details on grouping strategies.

### 5. Call Consensus Reads

Choose the consensus calling method based on your library preparation:

**Simplex consensus** (single-strand):
```bash
fgumi simplex \
  --input grouped.bam \
  --output consensus.bam
```

**Duplex consensus** (double-strand):
```bash
fgumi duplex \
  --input grouped.bam \
  --output duplex.bam
```

**CODEC consensus**:
```bash
fgumi codec \
  --input grouped.bam \
  --output codec_consensus.bam
```

See [Consensus Calling](consensus-calling.md) and [Duplex Consensus Calling](duplex-consensus-calling.md) for details.

### 6. (Optional) Collect QC Metrics

Collect QC metrics before filtering to understand your library.

**For simplex libraries**, use `simplex-metrics` on the grouped BAM:

```bash
fgumi simplex-metrics \
  --input grouped.bam \
  --output simplex_metrics
```

**For duplex libraries**, use `duplex-metrics` on the grouped BAM:

```bash
fgumi duplex-metrics \
  --input grouped.bam \
  --output duplex_metrics
```

Both commands write a set of metrics files under the given output prefix. See
[Working with Metrics](working-with-metrics.md) for details on interpreting the output.

### 7. Filter Consensus Reads

Filter consensus reads based on quality metrics. The `--min-reads` format depends on the
consensus type:

**For simplex consensus** (single integer):
```bash
fgumi filter \
  --input consensus.bam \
  --output filtered.bam \
  --ref ref.fa \
  --min-reads 1
```

**For duplex consensus** (three comma-separated values: duplex,AB,BA):
```bash
fgumi filter \
  --input duplex.bam \
  --output filtered.bam \
  --ref ref.fa \
  --min-reads 1,1,1
```

### 8. (Optional) Clip Overlapping Reads

Clip overlapping bases in read pairs to avoid double-counting evidence:

```bash
fgumi clip \
  --input filtered.bam \
  --output clipped.bam \
  --ref ref.fa
```

## What's Next

- [Best Practices]best-practices.md — recommended parameter settings and pipeline configuration
- [Performance Tuning]performance-tuning.md — threading, memory, and compression optimization
- [Working with Metrics]working-with-metrics.md — understanding fgumi's output metrics