fastars 0.1.0

Ultra-fast QC and trimming for short and long reads
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
# Fastars

**Pure-Rust implementation of QC and trimming for short and long reads.**

Inspired by [fastp](https://github.com/OpenGene/fastp) and [fastplong](https://github.com/OpenGene/fastplong), fastars combines both short-read and long-read processing capabilities in a single binary. Designed for high-throughput servers and large-scale parallel processing with significantly reduced memory footprint while maintaining comparable performance to fastp.

> [!caution]
> This project is AI-aided.

> [!warning]
> It is still under development and tested with limited size of samples.

## Key Features

- **Unified Tool**: Process both short reads (Illumina) and long reads (PacBio/ONT) with one binary
- **Pure Rust**: No C/C++ dependencies in core logic, safe and portable
- **Memory Efficient**: Uses 40-98% less memory than fastp - ideal for shared servers
- **High Performance**: Matches or exceeds fastp speed at 4+ threads (up to 1.6x faster)
- **fastp/fastplong Compatible**: Familiar CLI interface for easy migration
- **Auto Mode Detection**: Automatically detects short/long reads based on read length

## Performance

Benchmarked against fastp v1.0.1 with SRR21931795 (538K paired-end reads, ~186MB compressed).
"The metrics represent the average of five runs, following a warm-up phase.

| Threads | fastars Time | fastars Mem | fastp Time | fastp Mem | Speedup | Mem Saved |
|---------|--------------|-------------|------------|-----------|---------|-----------|
| 1 | 22.34s | **23MB** **\*** | 16.81s | 1,151MB | 0.75x | **98%** **\*** |
| 4 | 7.50s | 597MB | 7.66s | 1,253MB | **1.02x** | 52% |
| 8 | 4.94s | 250MB | 6.80s | 1,312MB | **1.38x** | 81% |
| **14** | **4.28s** | 215MB | 7.00s | 1,378MB | **1.64x** | 84% |
| 16 | 4.62s | 178MB | 7.05s | 1,411MB | **1.53x** | **87%** |

**\*** This is due to the single-thread mode acts in a different way than the others.


**Summary**:

- **4+ threads**: fastars matches or beats fastp
- **8-16 threads**: 1.4x-1.6x faster with 80%+ less memory
- **Best for**: Multi-core servers and memory-constrained environments

## fastp Compatibility Verification

fastars v0.7.0 produces **100% identical output sequences** to fastp v1.0.1 when using the same trimming parameters.

### Verification Test

**Dataset**: SRR29111767 (1.4M paired-end reads)
**Parameters**: `-3 --cut_mean_quality 20 --disable_adapter_trimming -G`

| Metric | fastars | fastp | Match |
|--------|---------|-------|-------|
| Reads passed | 1,354,558 | 1,354,558 ||
| R1 sequences | 677,279 | 677,279 | **100%** |
| R2 sequences | 677,279 | 677,279 | **100%** |

**Sequence-level verification**: All 677,279 output sequences are byte-for-byte identical between fastars and fastp for both R1 and R2.

### Algorithm Compatibility

fastars implements fastp's exact trimming algorithms:
- **Sliding window quality trimming**: Identical window calculation and trim position logic
- **Trailing N removal**: After quality trimming, trailing N bases are removed (fastp behavior)
- **Leading N removal**: After front quality trimming, leading N bases are removed (fastp behavior)

This ensures that fastars can be used as a drop-in replacement for fastp with identical results.

## Installation

### From crates.io

```bash
# Default build (recommended - uses zlib-ng for fast gzip)
cargo install fastars

# Pure Rust build (slower gzip, but fully portable)
cargo install fastars --no-default-features --features rust_backend
```

### From source

```bash
git clone https://github.com/necoli1822/fastars
cd fastars
cargo build --release
./target/release/fastars --help
```

## Usage

### Auto Mode (Recommended)

```bash
# Automatically detects short or long read mode
fastars -i reads.fq.gz -o filtered.fq.gz
```

### Short-Read Mode (Illumina)

```bash
# Single-end
fastars -i reads.fq.gz -o filtered.fq.gz --mode short

# Paired-end
fastars -i R1.fq.gz -I R2.fq.gz -o out_R1.fq.gz -O out_R2.fq.gz

# With QC reports
fastars -i R1.fq.gz -I R2.fq.gz \
    -o out_R1.fq.gz -O out_R2.fq.gz \
    -j report.json -h report.html
```

### Long-Read Mode (PacBio/ONT)

```bash
# Basic long-read processing
fastars -i long_reads.fq.gz -o filtered.fq.gz --mode long

# With adapter trimming
fastars -i long_reads.fq.gz -o filtered.fq.gz \
    -s "ATCTCTCTCAACAACAACAAC" \
    -E "ATCTCTCTCAACAACAACAAC"

# Quality masking (replace low-quality regions with N)
fastars -i long_reads.fq.gz -o filtered.fq.gz -N

# Read breaking (split at low-quality regions)
fastars -i long_reads.fq.gz -o filtered.fq.gz -b
```

### Quality Trimming

```bash
# Sliding window trimming from both ends
fastars -i reads.fq.gz -o out.fq.gz -5 -3

# Custom quality threshold
fastars -i reads.fq.gz -o out.fq.gz -5 -3 --cut_mean_quality 20
```

### Adapter Trimming

```bash
# Auto-detect adapters for PE reads
fastars -i R1.fq.gz -I R2.fq.gz -o out1.fq.gz -O out2.fq.gz --detect_adapter_for_pe

# Custom adapter sequences
fastars -i R1.fq.gz -o out.fq.gz -a AGATCGGAAGAGC -A AGATCGGAAGAGC
```

### Poly-X Trimming

```bash
# Poly-G trimming (NextSeq/NovaSeq artifacts)
fastars -i reads.fq.gz -o out.fq.gz -g

# Poly-X trimming (any homopolymer)
fastars -i reads.fq.gz -o out.fq.gz -x
```

### UMI Processing Example

```bash
fastars -i reads.fq.gz -o out.fq.gz \
    -U --umi_loc read1 --umi_len 8 --umi_prefix UMI
```

### Paired-End Merging & Correction

```bash
# Merge overlapping PE reads
fastars -i R1.fq.gz -I R2.fq.gz \
    -m --merged_out merged.fq.gz

# Base correction via overlap
fastars -i R1.fq.gz -I R2.fq.gz \
    -o out1.fq.gz -O out2.fq.gz -c
```

### Deduplication

```bash
fastars -i reads.fq.gz -o out.fq.gz -D
```

### Output Splitting Options

```bash
# Split into 4 files
fastars -i reads.fq.gz -o out.fq.gz --split 4
```

## CLI Options (fastp/fastplong Compatible)

### Input/Output

| Option | Description |
|--------|-------------|
| `-i, --in1` | Read 1 input file (required) |
| `-I, --in2` | Read 2 input file (paired-end) |
| `--interleaved_in` | Input is interleaved paired-end data |
| `-o, --out1` | Read 1 output file |
| `-O, --out2` | Read 2 output file |
| `--stdout` | Stream output to stdout |
| `--stdin_format` | Input format for stdin (auto/gzip/plain) |
| `-j, --json` | JSON report output |
| `-h, --html` | HTML report output |
| `-R, --report_title` | Report title (default: "fastars report") |
| `--failed_out` | Failed reads output file |
| `--unpaired1_out` | Unpaired read 1 output file |
| `--unpaired2_out` | Unpaired read 2 output file |
| `--fix_mgi_id` | Fix MGI sequencer IDs to Illumina format |
| `--dont_overwrite` | Do not overwrite existing output files |
| `-w, --thread` | Worker threads (0 = auto) |
| `-z, --compression` | Gzip level 1-9 (default: 4) |

### Mode Selection

| Option | Description |
|--------|-------------|
| `--mode` | Processing mode: auto, short, long (default: auto) |
| `--mode_detect_sample` | Reads to sample for mode detection (default: 100) |
| `--mode_detect_threshold` | Length threshold for mode detection (default: 500bp) |

### Quality Trimming

| Option | Description |
|--------|-------------|
| `-5, --cut_front` | Trim from 5' end |
| `--cut_front_window_size` | Window size for cut_front |
| `--cut_front_mean_quality` | Mean quality for cut_front |
| `-3, --cut_tail` | Trim from 3' end |
| `--cut_tail_window_size` | Window size for cut_tail |
| `--cut_tail_mean_quality` | Mean quality for cut_tail |
| `--cut_right` | Scan from 5' to 3', trim when quality drops |
| `--cut_right_window_size` | Window size for cut_right |
| `--cut_right_mean_quality` | Mean quality for cut_right |
| `--cut_window_size` | Sliding window size (default: 4) |
| `--cut_mean_quality` | Quality threshold (default: 15) |

### Adapter Trimming

| Option | Description |
|--------|-------------|
| `-a, --adapter_sequence` | R1 adapter sequence |
| `-A, --adapter_sequence_r2` | R2 adapter sequence |
| `--adapter_fasta` | FASTA file with adapter sequences |
| `--detect_adapter_for_pe` | Auto-detect adapters |
| `--disable_adapter_trimming` | Disable adapter trimming |

### Long-Read Specific (fastplong compatible)

| Option | Description |
|--------|-------------|
| `-s, --start_adapter` | 5' adapter for long reads |
| `-E, --end_adapter` | 3' adapter for long reads |
| `-d, --distance_threshold` | Adapter distance threshold (default: 0.25) |
| `--trimming_extension` | Extend trimming past adapter (default: 10) |
| `-N, --mask` | Quality masking mode |
| `--mask_window_size` | Window size for masking (default: 50) |
| `--mask_mean_quality` | Mean quality for masking (default: 10) |
| `-b, --break_reads` | Break reads at low-quality regions |
| `--break_window_size` | Window size for breaking (default: 100) |
| `--break_mean_quality` | Mean quality for breaking (default: 10) |

### Quality Filtering

| Option | Description |
|--------|-------------|
| `-Q, --disable_quality_filtering` | Disable quality filtering |
| `-q, --qualified_quality_phred` | Min quality for a base (default: 15) |
| `-u, --unqualified_percent_limit` | Max % unqualified bases (default: 40) |
| `-e, --average_qual` | Min average quality (default: 0) |

### Length Filtering

| Option | Description |
|--------|-------------|
| `-L, --disable_length_filtering` | Disable length filtering |
| `-l, --length_required` | Minimum length (default: 15) |
| `--length_limit` | Maximum length (0 = no limit) |
| `--max_len1` | Max length for R1 (truncate) |
| `--max_len2` | Max length for R2 (truncate) |

### N Filtering

| Option | Description |
|--------|-------------|
| `-n, --n_base_limit` | Max N bases (default: 5) |
| `--n_percent_limit` | Max N content as % (long mode only) |

### Index Barcode Filtering

| Option | Description |
|--------|-------------|
| `--filter_by_index1` | Filter by index 1 barcode |
| `--filter_by_index2` | Filter by index 2 barcode |
| `--filter_by_index_threshold` | Max mismatches for index filter (default: 0) |

### Complexity Filtering

| Option | Description |
|--------|-------------|
| `-y, --low_complexity_filter` | Enable complexity filter |
| `-Y, --complexity_threshold` | Complexity threshold 0-100 (default: 30) |

### Poly-X Trimming

| Option | Description |
|--------|-------------|
| `-g, --trim_poly_g` | Trim poly-G tails |
| `--poly_g_min_len` | Min poly-G length (default: 10) |
| `-G, --disable_trim_poly_g` | Disable poly-G trimming |
| `-x, --trim_poly_x` | Trim poly-X tails |
| `--poly_x_min_len` | Min poly-X length (default: 10) |

### Global Trimming

| Option | Description |
|--------|-------------|
| `-f, --trim_front1` | Trim N bases from front of R1 |
| `-t, --trim_tail1` | Trim N bases from tail of R1 |
| `-F, --trim_front2` | Trim N bases from front of R2 |
| `-T, --trim_tail2` | Trim N bases from tail of R2 |

### Deduplication

| Option | Description |
|--------|-------------|
| `-D, --dedup` | Enable deduplication |
| `--dup_calc_accuracy` | Accuracy level 1-6 (default: 3) |
| `--dont_eval_duplication` | Disable duplication rate evaluation |

### Overrepresentation Analysis

| Option | Description |
|--------|-------------|
| `-p, --overrepresentation_analysis` | Enable analysis (default: on) |
| `-P, --overrepresentation_sampling` | Sampling rate (default: 20) |

### UMI Processing

| Option | Description |
|--------|-------------|
| `-U, --umi` | Enable UMI processing |
| `--umi_loc` | UMI location: read1, read2, index, per_index |
| `--umi_len` | UMI length (required if --umi enabled) |
| `--umi_prefix` | Prefix added before UMI (default: empty) |
| `--umi_skip` | Skip first N bases before UMI (default: 0) |
| `--umi_separator` | Separator between name and UMI (default: ":") |

### Paired-end Merging

| Option | Description |
|--------|-------------|
| `-m, --merge` | Enable PE read merging |
| `--merged_out` | Output file for merged reads |
| `--out_unmerged1` | Output for unmerged R1 |
| `--out_unmerged2` | Output for unmerged R2 |
| `--merge_min_overlap` | Min overlap for merging (default: 30) |
| `--merge_max_mismatch_ratio` | Max mismatch ratio (default: 0.1) |
| `--merge_correct_mismatches` | Correct mismatches in overlap (default: true) |

### Base Correction

| Option | Description |
|--------|-------------|
| `-c, --correction` | Enable overlap-based correction |
| `--overlap_len_require` | Min overlap for correction (default: 30) |
| `--overlap_diff_limit` | Max mismatches for correction (default: 5) |
| `--overlap_diff_percent_limit` | Max mismatch % (default: 5.0%) |
| `--allow_gap_overlap_trimming` | Allow gaps in overlap detection |
| `--overlapped_out` | Output only overlapped region |

### Output Splitting

| Option | Description |
|--------|-------------|
| `--split` | Split output into N files |
| `--split_by_lines` | Split by number of lines (4 lines = 1 read) |
| `--split_prefix_digits` | Digits in split suffix (default: 4) |

### Other

| Option | Description |
|--------|-------------|
| `-6, --phred64` | Phred64 quality encoding |
| `-V, --verbose` | Verbose output |
| `--reads_to_process` | Number of reads to process (0 = all) |

## License

MIT License. See [LICENSE](LICENSE) for details.

## Author

Sunju Kim (<n.e.coli.1822@gmail.com>)

## Acknowledgments

Inspired by [fastp](https://github.com/OpenGene/fastp) and [fastplong](https://github.com/OpenGene/fastplong) by Shifu Chen.