fastars 0.1.0 - Docs.rs

# Fastars

**Pure-Rust implementation of QC and trimming for short and long reads.**

Inspired by [fastp](https://github.com/OpenGene/fastp) and [fastplong](https://github.com/OpenGene/fastplong), fastars combines both short-read and long-read processing capabilities in a single binary. Designed for high-throughput servers and large-scale parallel processing with significantly reduced memory footprint while maintaining comparable performance to fastp.

> [!caution]
> This project is AI-aided.

> [!warning]
> It is still under development and tested with limited size of samples.

## Key Features

- **Unified Tool**: Process both short reads (Illumina) and long reads (PacBio/ONT) with one binary
- **Pure Rust**: No C/C++ dependencies in core logic, safe and portable
- **Memory Efficient**: Uses 40-98% less memory than fastp - ideal for shared servers
- **High Performance**: Matches or exceeds fastp speed at 4+ threads (up to 1.6x faster)
- **fastp/fastplong Compatible**: Familiar CLI interface for easy migration
- **Auto Mode Detection**: Automatically detects short/long reads based on read length

## Performance

Benchmarked against fastp v1.0.1 with SRR21931795 (538K paired-end reads, ~186MB compressed).
"The metrics represent the average of five runs, following a warm-up phase.

| Threads | fastars Time | fastars Mem | fastp Time | fastp Mem | Speedup | Mem Saved |
|---------|--------------|-------------|------------|-----------|---------|-----------|
| 1 | 22.34s | **23MB** **\*** | 16.81s | 1,151MB | 0.75x | **98%** **\*** |
| 4 | 7.50s | 597MB | 7.66s | 1,253MB | **1.02x** | 52% |
| 8 | 4.94s | 250MB | 6.80s | 1,312MB | **1.38x** | 81% |
| **14** | **4.28s** | 215MB | 7.00s | 1,378MB | **1.64x** | 84% |
| 16 | 4.62s | 178MB | 7.05s | 1,411MB | **1.53x** | **87%** |

**\*** This is due to the single-thread mode acts in a different way than the others.


**Summary**:

- **4+ threads**: fastars matches or beats fastp
- **8-16 threads**: 1.4x-1.6x faster with 80%+ less memory
- **Best for**: Multi-core servers and memory-constrained environments

## fastp Compatibility Verification

fastars v0.7.0 produces **100% identical output sequences** to fastp v1.0.1 when using the same trimming parameters.

### Verification Test

**Dataset**: SRR29111767 (1.4M paired-end reads)
**Parameters**: `-3 --cut_mean_quality 20 --disable_adapter_trimming -G`

| Metric | fastars | fastp | Match |
|--------|---------|-------|-------|
| Reads passed | 1,354,558 | 1,354,558 | ✓ |
| R1 sequences | 677,279 | 677,279 | **100%** |
| R2 sequences | 677,279 | 677,279 | **100%** |

**Sequence-level verification**: All 677,279 output sequences are byte-for-byte identical between fastars and fastp for both R1 and R2.

### Algorithm Compatibility

fastars implements fastp's exact trimming algorithms:
- **Sliding window quality trimming**: Identical window calculation and trim position logic
- **Trailing N removal**: After quality trimming, trailing N bases are removed (fastp behavior)
- **Leading N removal**: After front quality trimming, leading N bases are removed (fastp behavior)

This ensures that fastars can be used as a drop-in replacement for fastp with identical results.

## Installation

### From crates.io

```bash
# Default build (recommended - uses zlib-ng for fast gzip)
cargo install fastars

# Pure Rust build (slower gzip, but fully portable)
cargo install fastars --no-default-features --features rust_backend
```

### From source

```bash
git clone https://github.com/necoli1822/fastars
cd fastars
cargo build --release
./target/release/fastars --help
```

## Usage

### Auto Mode (Recommended)

```bash
# Automatically detects short or long read mode
fastars -i reads.fq.gz -o filtered.fq.gz
```

### Short-Read Mode (Illumina)

```bash
# Single-end
fastars -i reads.fq.gz -o filtered.fq.gz --mode short

# Paired-end
fastars -i R1.fq.gz -I R2.fq.gz -o out_R1.fq.gz -O out_R2.fq.gz

# With QC reports
fastars -i R1.fq.gz -I R2.fq.gz \
    -o out_R1.fq.gz -O out_R2.fq.gz \
    -j report.json -h report.html
```

### Long-Read Mode (PacBio/ONT)

```bash
# Basic long-read processing
fastars -i long_reads.fq.gz -o filtered.fq.gz --mode long

# With adapter trimming
fastars -i long_reads.fq.gz -o filtered.fq.gz \
    -s "ATCTCTCTCAACAACAACAAC" \
    -E "ATCTCTCTCAACAACAACAAC"

# Quality masking (replace low-quality regions with N)
fastars -i long_reads.fq.gz -o filtered.fq.gz -N

# Read breaking (split at low-quality regions)
fastars -i long_reads.fq.gz -o filtered.fq.gz -b
```

### Quality Trimming

```bash
# Sliding window trimming from both ends
fastars -i reads.fq.gz -o out.fq.gz -5 -3

# Custom quality threshold
fastars -i reads.fq.gz -o out.fq.gz -5 -3 --cut_mean_quality 20
```

### Adapter Trimming

```bash
# Auto-detect adapters for PE reads
fastars -i R1.fq.gz -I R2.fq.gz -o out1.fq.gz -O out2.fq.gz --detect_adapter_for_pe

# Custom adapter sequences
fastars -i R1.fq.gz -o out.fq.gz -a AGATCGGAAGAGC -A AGATCGGAAGAGC
```

### Poly-X Trimming

```bash
# Poly-G trimming (NextSeq/NovaSeq artifacts)
fastars -i reads.fq.gz -o out.fq.gz -g

# Poly-X trimming (any homopolymer)
fastars -i reads.fq.gz -o out.fq.gz -x
```

### UMI Processing Example

```bash
fastars -i reads.fq.gz -o out.fq.gz \
    -U --umi_loc read1 --umi_len 8 --umi_prefix UMI
```

### Paired-End Merging & Correction

```bash
# Merge overlapping PE reads
fastars -i R1.fq.gz -I R2.fq.gz \
    -m --merged_out merged.fq.gz

# Base correction via overlap
fastars -i R1.fq.gz -I R2.fq.gz \
    -o out1.fq.gz -O out2.fq.gz -c
```

### Deduplication

```bash
fastars -i reads.fq.gz -o out.fq.gz -D
```

### Output Splitting Options

```bash
# Split into 4 files
fastars -i reads.fq.gz -o out.fq.gz --split 4
```

## CLI Options (fastp/fastplong Compatible)

### Input/Output

| Option | Description |
|--------|-------------|
| `-i, --in1` | Read 1 input file (required) |
| `-I, --in2` | Read 2 input file (paired-end) |
| `--interleaved_in` | Input is interleaved paired-end data |
| `-o, --out1` | Read 1 output file |
| `-O, --out2` | Read 2 output file |
| `--stdout` | Stream output to stdout |
| `--stdin_format` | Input format for stdin (auto/gzip/plain) |
| `-j, --json` | JSON report output |
| `-h, --html` | HTML report output |
| `-R, --report_title` | Report title (default: "fastars report") |
| `--failed_out` | Failed reads output file |
| `--unpaired1_out` | Unpaired read 1 output file |
| `--unpaired2_out` | Unpaired read 2 output file |
| `--fix_mgi_id` | Fix MGI sequencer IDs to Illumina format |
| `--dont_overwrite` | Do not overwrite existing output files |
| `-w, --thread` | Worker threads (0 = auto) |
| `-z, --compression` | Gzip level 1-9 (default: 4) |

### Mode Selection

| Option | Description |
|--------|-------------|
| `--mode` | Processing mode: auto, short, long (default: auto) |
| `--mode_detect_sample` | Reads to sample for mode detection (default: 100) |
| `--mode_detect_threshold` | Length threshold for mode detection (default: 500bp) |

### Quality Trimming

| Option | Description |
|--------|-------------|
| `-5, --cut_front` | Trim from 5' end |
| `--cut_front_window_size` | Window size for cut_front |
| `--cut_front_mean_quality` | Mean quality for cut_front |
| `-3, --cut_tail` | Trim from 3' end |
| `--cut_tail_window_size` | Window size for cut_tail |
| `--cut_tail_mean_quality` | Mean quality for cut_tail |
| `--cut_right` | Scan from 5' to 3', trim when quality drops |
| `--cut_right_window_size` | Window size for cut_right |
| `--cut_right_mean_quality` | Mean quality for cut_right |
| `--cut_window_size` | Sliding window size (default: 4) |
| `--cut_mean_quality` | Quality threshold (default: 15) |

### Adapter Trimming

| Option | Description |
|--------|-------------|
| `-a, --adapter_sequence` | R1 adapter sequence |
| `-A, --adapter_sequence_r2` | R2 adapter sequence |
| `--adapter_fasta` | FASTA file with adapter sequences |
| `--detect_adapter_for_pe` | Auto-detect adapters |
| `--disable_adapter_trimming` | Disable adapter trimming |

### Long-Read Specific (fastplong compatible)

| Option | Description |
|--------|-------------|
| `-s, --start_adapter` | 5' adapter for long reads |
| `-E, --end_adapter` | 3' adapter for long reads |
| `-d, --distance_threshold` | Adapter distance threshold (default: 0.25) |
| `--trimming_extension` | Extend trimming past adapter (default: 10) |
| `-N, --mask` | Quality masking mode |
| `--mask_window_size` | Window size for masking (default: 50) |
| `--mask_mean_quality` | Mean quality for masking (default: 10) |
| `-b, --break_reads` | Break reads at low-quality regions |
| `--break_window_size` | Window size for breaking (default: 100) |
| `--break_mean_quality` | Mean quality for breaking (default: 10) |

### Quality Filtering

| Option | Description |
|--------|-------------|
| `-Q, --disable_quality_filtering` | Disable quality filtering |
| `-q, --qualified_quality_phred` | Min quality for a base (default: 15) |
| `-u, --unqualified_percent_limit` | Max % unqualified bases (default: 40) |
| `-e, --average_qual` | Min average quality (default: 0) |

### Length Filtering

| Option | Description |
|--------|-------------|
| `-L, --disable_length_filtering` | Disable length filtering |
| `-l, --length_required` | Minimum length (default: 15) |
| `--length_limit` | Maximum length (0 = no limit) |
| `--max_len1` | Max length for R1 (truncate) |
| `--max_len2` | Max length for R2 (truncate) |

### N Filtering

| Option | Description |
|--------|-------------|
| `-n, --n_base_limit` | Max N bases (default: 5) |
| `--n_percent_limit` | Max N content as % (long mode only) |

### Index Barcode Filtering

| Option | Description |
|--------|-------------|
| `--filter_by_index1` | Filter by index 1 barcode |
| `--filter_by_index2` | Filter by index 2 barcode |
| `--filter_by_index_threshold` | Max mismatches for index filter (default: 0) |

### Complexity Filtering

| Option | Description |
|--------|-------------|
| `-y, --low_complexity_filter` | Enable complexity filter |
| `-Y, --complexity_threshold` | Complexity threshold 0-100 (default: 30) |

### Poly-X Trimming

| Option | Description |
|--------|-------------|
| `-g, --trim_poly_g` | Trim poly-G tails |
| `--poly_g_min_len` | Min poly-G length (default: 10) |
| `-G, --disable_trim_poly_g` | Disable poly-G trimming |
| `-x, --trim_poly_x` | Trim poly-X tails |
| `--poly_x_min_len` | Min poly-X length (default: 10) |

### Global Trimming

| Option | Description |
|--------|-------------|
| `-f, --trim_front1` | Trim N bases from front of R1 |
| `-t, --trim_tail1` | Trim N bases from tail of R1 |
| `-F, --trim_front2` | Trim N bases from front of R2 |
| `-T, --trim_tail2` | Trim N bases from tail of R2 |

### Deduplication

| Option | Description |
|--------|-------------|
| `-D, --dedup` | Enable deduplication |
| `--dup_calc_accuracy` | Accuracy level 1-6 (default: 3) |
| `--dont_eval_duplication` | Disable duplication rate evaluation |

### Overrepresentation Analysis

| Option | Description |
|--------|-------------|
| `-p, --overrepresentation_analysis` | Enable analysis (default: on) |
| `-P, --overrepresentation_sampling` | Sampling rate (default: 20) |

### UMI Processing

| Option | Description |
|--------|-------------|
| `-U, --umi` | Enable UMI processing |
| `--umi_loc` | UMI location: read1, read2, index, per_index |
| `--umi_len` | UMI length (required if --umi enabled) |
| `--umi_prefix` | Prefix added before UMI (default: empty) |
| `--umi_skip` | Skip first N bases before UMI (default: 0) |
| `--umi_separator` | Separator between name and UMI (default: ":") |

### Paired-end Merging

| Option | Description |
|--------|-------------|
| `-m, --merge` | Enable PE read merging |
| `--merged_out` | Output file for merged reads |
| `--out_unmerged1` | Output for unmerged R1 |
| `--out_unmerged2` | Output for unmerged R2 |
| `--merge_min_overlap` | Min overlap for merging (default: 30) |
| `--merge_max_mismatch_ratio` | Max mismatch ratio (default: 0.1) |
| `--merge_correct_mismatches` | Correct mismatches in overlap (default: true) |

### Base Correction

| Option | Description |
|--------|-------------|
| `-c, --correction` | Enable overlap-based correction |
| `--overlap_len_require` | Min overlap for correction (default: 30) |
| `--overlap_diff_limit` | Max mismatches for correction (default: 5) |
| `--overlap_diff_percent_limit` | Max mismatch % (default: 5.0%) |
| `--allow_gap_overlap_trimming` | Allow gaps in overlap detection |
| `--overlapped_out` | Output only overlapped region |

### Output Splitting

| Option | Description |
|--------|-------------|
| `--split` | Split output into N files |
| `--split_by_lines` | Split by number of lines (4 lines = 1 read) |
| `--split_prefix_digits` | Digits in split suffix (default: 4) |

### Other

| Option | Description |
|--------|-------------|
| `-6, --phred64` | Phred64 quality encoding |
| `-V, --verbose` | Verbose output |
| `--reads_to_process` | Number of reads to process (0 = all) |

## License

MIT License. See [LICENSE](LICENSE) for details.

## Author

Sunju Kim (<n.e.coli.1822@gmail.com>)

## Acknowledgments

Inspired by [fastp](https://github.com/OpenGene/fastp) and [fastplong](https://github.com/OpenGene/fastplong) by Shifu Chen.