fastars 0.1.0

Ultra-fast QC and trimming for short and long reads
Documentation

Fastars

Pure-Rust implementation of QC and trimming for short and long reads.

Inspired by fastp and fastplong, fastars combines both short-read and long-read processing capabilities in a single binary. Designed for high-throughput servers and large-scale parallel processing with significantly reduced memory footprint while maintaining comparable performance to fastp.

[!caution] This project is AI-aided.

[!warning] It is still under development and tested with limited size of samples.

Key Features

  • Unified Tool: Process both short reads (Illumina) and long reads (PacBio/ONT) with one binary
  • Pure Rust: No C/C++ dependencies in core logic, safe and portable
  • Memory Efficient: Uses 40-98% less memory than fastp - ideal for shared servers
  • High Performance: Matches or exceeds fastp speed at 4+ threads (up to 1.6x faster)
  • fastp/fastplong Compatible: Familiar CLI interface for easy migration
  • Auto Mode Detection: Automatically detects short/long reads based on read length

Performance

Benchmarked against fastp v1.0.1 with SRR21931795 (538K paired-end reads, ~186MB compressed). "The metrics represent the average of five runs, following a warm-up phase.

Threads fastars Time fastars Mem fastp Time fastp Mem Speedup Mem Saved
1 22.34s 23MB * 16.81s 1,151MB 0.75x 98% *
4 7.50s 597MB 7.66s 1,253MB 1.02x 52%
8 4.94s 250MB 6.80s 1,312MB 1.38x 81%
14 4.28s 215MB 7.00s 1,378MB 1.64x 84%
16 4.62s 178MB 7.05s 1,411MB 1.53x 87%

* This is due to the single-thread mode acts in a different way than the others.

Summary:

  • 4+ threads: fastars matches or beats fastp
  • 8-16 threads: 1.4x-1.6x faster with 80%+ less memory
  • Best for: Multi-core servers and memory-constrained environments

fastp Compatibility Verification

fastars v0.7.0 produces 100% identical output sequences to fastp v1.0.1 when using the same trimming parameters.

Verification Test

Dataset: SRR29111767 (1.4M paired-end reads) Parameters: -3 --cut_mean_quality 20 --disable_adapter_trimming -G

Metric fastars fastp Match
Reads passed 1,354,558 1,354,558
R1 sequences 677,279 677,279 100%
R2 sequences 677,279 677,279 100%

Sequence-level verification: All 677,279 output sequences are byte-for-byte identical between fastars and fastp for both R1 and R2.

Algorithm Compatibility

fastars implements fastp's exact trimming algorithms:

  • Sliding window quality trimming: Identical window calculation and trim position logic
  • Trailing N removal: After quality trimming, trailing N bases are removed (fastp behavior)
  • Leading N removal: After front quality trimming, leading N bases are removed (fastp behavior)

This ensures that fastars can be used as a drop-in replacement for fastp with identical results.

Installation

From crates.io

# Default build (recommended - uses zlib-ng for fast gzip)
cargo install fastars

# Pure Rust build (slower gzip, but fully portable)
cargo install fastars --no-default-features --features rust_backend

From source

git clone https://github.com/necoli1822/fastars
cd fastars
cargo build --release
./target/release/fastars --help

Usage

Auto Mode (Recommended)

# Automatically detects short or long read mode
fastars -i reads.fq.gz -o filtered.fq.gz

Short-Read Mode (Illumina)

# Single-end
fastars -i reads.fq.gz -o filtered.fq.gz --mode short

# Paired-end
fastars -i R1.fq.gz -I R2.fq.gz -o out_R1.fq.gz -O out_R2.fq.gz

# With QC reports
fastars -i R1.fq.gz -I R2.fq.gz \
    -o out_R1.fq.gz -O out_R2.fq.gz \
    -j report.json -h report.html

Long-Read Mode (PacBio/ONT)

# Basic long-read processing
fastars -i long_reads.fq.gz -o filtered.fq.gz --mode long

# With adapter trimming
fastars -i long_reads.fq.gz -o filtered.fq.gz \
    -s "ATCTCTCTCAACAACAACAAC" \
    -E "ATCTCTCTCAACAACAACAAC"

# Quality masking (replace low-quality regions with N)
fastars -i long_reads.fq.gz -o filtered.fq.gz -N

# Read breaking (split at low-quality regions)
fastars -i long_reads.fq.gz -o filtered.fq.gz -b

Quality Trimming

# Sliding window trimming from both ends
fastars -i reads.fq.gz -o out.fq.gz -5 -3

# Custom quality threshold
fastars -i reads.fq.gz -o out.fq.gz -5 -3 --cut_mean_quality 20

Adapter Trimming

# Auto-detect adapters for PE reads
fastars -i R1.fq.gz -I R2.fq.gz -o out1.fq.gz -O out2.fq.gz --detect_adapter_for_pe

# Custom adapter sequences
fastars -i R1.fq.gz -o out.fq.gz -a AGATCGGAAGAGC -A AGATCGGAAGAGC

Poly-X Trimming

# Poly-G trimming (NextSeq/NovaSeq artifacts)
fastars -i reads.fq.gz -o out.fq.gz -g

# Poly-X trimming (any homopolymer)
fastars -i reads.fq.gz -o out.fq.gz -x

UMI Processing Example

fastars -i reads.fq.gz -o out.fq.gz \
    -U --umi_loc read1 --umi_len 8 --umi_prefix UMI

Paired-End Merging & Correction

# Merge overlapping PE reads
fastars -i R1.fq.gz -I R2.fq.gz \
    -m --merged_out merged.fq.gz

# Base correction via overlap
fastars -i R1.fq.gz -I R2.fq.gz \
    -o out1.fq.gz -O out2.fq.gz -c

Deduplication

fastars -i reads.fq.gz -o out.fq.gz -D

Output Splitting Options

# Split into 4 files
fastars -i reads.fq.gz -o out.fq.gz --split 4

CLI Options (fastp/fastplong Compatible)

Input/Output

Option Description
-i, --in1 Read 1 input file (required)
-I, --in2 Read 2 input file (paired-end)
--interleaved_in Input is interleaved paired-end data
-o, --out1 Read 1 output file
-O, --out2 Read 2 output file
--stdout Stream output to stdout
--stdin_format Input format for stdin (auto/gzip/plain)
-j, --json JSON report output
-h, --html HTML report output
-R, --report_title Report title (default: "fastars report")
--failed_out Failed reads output file
--unpaired1_out Unpaired read 1 output file
--unpaired2_out Unpaired read 2 output file
--fix_mgi_id Fix MGI sequencer IDs to Illumina format
--dont_overwrite Do not overwrite existing output files
-w, --thread Worker threads (0 = auto)
-z, --compression Gzip level 1-9 (default: 4)

Mode Selection

Option Description
--mode Processing mode: auto, short, long (default: auto)
--mode_detect_sample Reads to sample for mode detection (default: 100)
--mode_detect_threshold Length threshold for mode detection (default: 500bp)

Quality Trimming

Option Description
-5, --cut_front Trim from 5' end
--cut_front_window_size Window size for cut_front
--cut_front_mean_quality Mean quality for cut_front
-3, --cut_tail Trim from 3' end
--cut_tail_window_size Window size for cut_tail
--cut_tail_mean_quality Mean quality for cut_tail
--cut_right Scan from 5' to 3', trim when quality drops
--cut_right_window_size Window size for cut_right
--cut_right_mean_quality Mean quality for cut_right
--cut_window_size Sliding window size (default: 4)
--cut_mean_quality Quality threshold (default: 15)

Adapter Trimming

Option Description
-a, --adapter_sequence R1 adapter sequence
-A, --adapter_sequence_r2 R2 adapter sequence
--adapter_fasta FASTA file with adapter sequences
--detect_adapter_for_pe Auto-detect adapters
--disable_adapter_trimming Disable adapter trimming

Long-Read Specific (fastplong compatible)

Option Description
-s, --start_adapter 5' adapter for long reads
-E, --end_adapter 3' adapter for long reads
-d, --distance_threshold Adapter distance threshold (default: 0.25)
--trimming_extension Extend trimming past adapter (default: 10)
-N, --mask Quality masking mode
--mask_window_size Window size for masking (default: 50)
--mask_mean_quality Mean quality for masking (default: 10)
-b, --break_reads Break reads at low-quality regions
--break_window_size Window size for breaking (default: 100)
--break_mean_quality Mean quality for breaking (default: 10)

Quality Filtering

Option Description
-Q, --disable_quality_filtering Disable quality filtering
-q, --qualified_quality_phred Min quality for a base (default: 15)
-u, --unqualified_percent_limit Max % unqualified bases (default: 40)
-e, --average_qual Min average quality (default: 0)

Length Filtering

Option Description
-L, --disable_length_filtering Disable length filtering
-l, --length_required Minimum length (default: 15)
--length_limit Maximum length (0 = no limit)
--max_len1 Max length for R1 (truncate)
--max_len2 Max length for R2 (truncate)

N Filtering

Option Description
-n, --n_base_limit Max N bases (default: 5)
--n_percent_limit Max N content as % (long mode only)

Index Barcode Filtering

Option Description
--filter_by_index1 Filter by index 1 barcode
--filter_by_index2 Filter by index 2 barcode
--filter_by_index_threshold Max mismatches for index filter (default: 0)

Complexity Filtering

Option Description
-y, --low_complexity_filter Enable complexity filter
-Y, --complexity_threshold Complexity threshold 0-100 (default: 30)

Poly-X Trimming

Option Description
-g, --trim_poly_g Trim poly-G tails
--poly_g_min_len Min poly-G length (default: 10)
-G, --disable_trim_poly_g Disable poly-G trimming
-x, --trim_poly_x Trim poly-X tails
--poly_x_min_len Min poly-X length (default: 10)

Global Trimming

Option Description
-f, --trim_front1 Trim N bases from front of R1
-t, --trim_tail1 Trim N bases from tail of R1
-F, --trim_front2 Trim N bases from front of R2
-T, --trim_tail2 Trim N bases from tail of R2

Deduplication

Option Description
-D, --dedup Enable deduplication
--dup_calc_accuracy Accuracy level 1-6 (default: 3)
--dont_eval_duplication Disable duplication rate evaluation

Overrepresentation Analysis

Option Description
-p, --overrepresentation_analysis Enable analysis (default: on)
-P, --overrepresentation_sampling Sampling rate (default: 20)

UMI Processing

Option Description
-U, --umi Enable UMI processing
--umi_loc UMI location: read1, read2, index, per_index
--umi_len UMI length (required if --umi enabled)
--umi_prefix Prefix added before UMI (default: empty)
--umi_skip Skip first N bases before UMI (default: 0)
--umi_separator Separator between name and UMI (default: ":")

Paired-end Merging

Option Description
-m, --merge Enable PE read merging
--merged_out Output file for merged reads
--out_unmerged1 Output for unmerged R1
--out_unmerged2 Output for unmerged R2
--merge_min_overlap Min overlap for merging (default: 30)
--merge_max_mismatch_ratio Max mismatch ratio (default: 0.1)
--merge_correct_mismatches Correct mismatches in overlap (default: true)

Base Correction

Option Description
-c, --correction Enable overlap-based correction
--overlap_len_require Min overlap for correction (default: 30)
--overlap_diff_limit Max mismatches for correction (default: 5)
--overlap_diff_percent_limit Max mismatch % (default: 5.0%)
--allow_gap_overlap_trimming Allow gaps in overlap detection
--overlapped_out Output only overlapped region

Output Splitting

Option Description
--split Split output into N files
--split_by_lines Split by number of lines (4 lines = 1 read)
--split_prefix_digits Digits in split suffix (default: 4)

Other

Option Description
-6, --phred64 Phred64 quality encoding
-V, --verbose Verbose output
--reads_to_process Number of reads to process (0 = all)

License

MIT License. See LICENSE for details.

Author

Sunju Kim (n.e.coli.1822@gmail.com)

Acknowledgments

Inspired by fastp and fastplong by Shifu Chen.