fgumi 0.2.0

High-performance tools for UMI-tagged sequencing data: extraction, grouping, and consensus calling
Documentation
# Read Structures

## Overview

A *Read Structure* is a string that describes how the bases in a sequencing run should be allocated into logical reads. It serves a similar purpose to the `--use-bases-mask` in Illumina's bcl-convert, but provides additional capabilities.

A Read Structure is a sequence of `<number><operator>` pairs (called *segments*). The last segment may use `+` instead of a number to mean "whatever bases remain." fgumi uses the [`read-structure`](https://crates.io/crates/read-structure) crate for parsing and validation.

Read structures are used primarily in `fgumi extract` to specify where UMI bases, template bases, and other sequences are located in each FASTQ read.

## Operators

Five kinds of operator are supported:

| Operator | Name | Meaning |
|----------|------|---------|
| `T` | Template | Reads of template (e.g. genomic DNA, RNA) |
| `B` | Sample Barcode | Index sequence for sample identification |
| `M` | Molecular Barcode | UMI sequence for identifying the source molecule |
| `C` | Cell Barcode | Index sequence for identifying the cell (single-cell) |
| `S` | Skip | Bases to skip or ignore (e.g. monotemplate from library prep) |

## Rules

- Any number of segments >= 1 is valid
- The length of each segment must be a positive integer >= 1, or `+`
- Only the last segment in a read structure may use `+` for its length
- Adjacent segments may use the same operator (e.g. `6B6B+T` is valid if two sample indices are ligated separately)

## Examples

### Simple paired-end (2x150bp, no indices)

Per-read structures: `+T`, `+T`

### Paired-end with 8bp sample index

Per-read structures: `+T`, `8B`, `+T`

### Paired-end with inline 6bp UMI in R1

Per-read structures: `6M+T`, `8B`, `+T`

The first 6 bases of R1 are the UMI, followed by template.

### Duplex sequencing with dual barcoding and UMI + monotemplate

Per-read structures: `10M5S+T`, `8B`, `8B`, `10M5S+T`

Both R1 and R2 start with a 10bp UMI followed by 5bp of monotemplate (skipped), then template.

### Single-cell with cell barcodes and UMI

Per-read structures: `5C30S5C3S8M+T`, `8B`, `+T`

R1 contains two cell barcodes separated by linker sequences, then a UMI, then template.

## Formal Grammar

```text
<read-structure>     ::= <fixed-structure> <segment>
<fixed-structure>    ::= "" | <fixed-length> <operator> <fixed-structure>
<segment>            ::= <fixed-length> <operator> | <variable-length> <operator>
<operator>           ::= "T" | "B" | "M" | "C" | "S"
<fixed-length>       ::= <non-zero-digit>{<digit>}
<variable-length>    ::= "+"
<any-length>         ::= <fixed-length> | <variable-length>
<non-zero-digit>     ::= "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
<digit>              ::= "0" | <non-zero-digit>
```