selexqc - RNA Capture-SELEX Library Quality Control
High-performance quality control and filtering tool for RNA Capture-SELEX NGS libraries, written in Rust.
Features
- โก Fast: Parallel processing with zero-copy parsing using
needletail - ๐ Format Support: FASTA, FASTQ, and gzipped formats (
.fa,.fq,.fq.gz) - ๐ Case-Insensitive: Constant region matching ignores case
- ๐๏ธ Structure-Aware: Validates upstream/downstream regions as paired structures
- ๐ Flexible Logic: AND (strict) or OR (lenient) validation modes
- ๐ฏ Filtering: Save valid sequences to file (FASTA/FASTQ/FASTQ.gz)
- ๐ MultiQC Integration: Generate MultiQC-compatible reports
- ๐งต Scalable: Multi-threaded processing for large datasets
- ๐พ Memory Efficient: Streaming processing with minimal memory footprint
Installation
From Source (Cargo)
# Binary: target/release/selexqc
Using Nix
# Binary: result/bin/selexqc
Quick Start
Basic Usage
# Check sequences for constant region presence
With Structure Validation (Recommended)
# N40:N10 library (40bp upstream + 10bp constant + 10bp downstream)
Usage
Required Arguments
| Argument | Description |
|---|---|
-i, --input <FILE> |
Input sequence file (FASTA/FASTQ, optionally gzipped) |
-o, --output <PREFIX> |
Output file prefix |
-c, --constant <SEQ> |
Constant region sequence (case-insensitive) |
Validation Options
| Option | Default | Description |
|---|---|---|
--validation-mode <MODE> |
and |
Validation logic: and (strict) or or (lenient) |
--min-length <INT> |
- | Minimum total sequence length |
--max-length <INT> |
- | Maximum total sequence length |
--upstream-length <INT> |
- | Expected upstream length (before constant) |
--upstream-tolerance <INT> |
- | Upstream length tolerance (+/-) |
--downstream-length <INT> |
- | Expected downstream length (after constant) |
--downstream-tolerance <INT> |
- | Downstream length tolerance (+/-) |
-q, --min-quality <FLOAT> |
- | Minimum average quality score (FASTQ only) |
Filtering Options
| Option | Default | Description |
|---|---|---|
--filter |
disabled | Enable filtering (save valid sequences) |
--filter-format <FORMAT> |
fasta |
Output format: fasta, fastq, or fastq.gz |
Performance & Output
| Option | Default | Description |
|---|---|---|
-t, --threads <INT> |
4 |
Number of threads for parallel processing |
-f, --formats <LIST> |
txt,json |
Report formats (comma-separated: txt,csv,json) |
--multiqc |
disabled | Generate MultiQC-compatible report |
Validation Logic
AND Mode (Strict) - Default
ALL criteria must pass for a sequence to be valid:
- โ Constant region present
- โ Total length in range (if specified)
- โ Upstream length correct (if specified)
- โ Downstream length correct (if specified)
- โ Quality sufficient (if specified)
Use case: Strict quality control for homogeneous libraries
OR Mode (Lenient)
ANY criterion can pass for a sequence to be valid:
- โ Constant region present OR
- โ Total length in range OR
- โ Upstream/downstream correct OR
- โ Quality sufficient
Use case: Mixed libraries or exploratory analysis
Structure Pair Validation
When both --upstream-length and --downstream-length are specified:
- Validates the paired structure: upstream AND downstream together
- Counts sequences where BOTH regions are within tolerance
- Reports paired distribution:
(upstream, downstream)
Example:
Expected: 40bp upstream + 10bp constant + 10bp downstream = 60bp total
Tolerance: ยฑ2bp upstream, ยฑ1bp downstream
โ Valid: 40bp - TGGCCACCAT - 10bp (exact match)
โ Valid: 38bp - TGGCCACCAT - 11bp (within tolerance)
โ Invalid: 40bp - TGGCCACCAT - 15bp (downstream too long)
โ Invalid: 35bp - TGGCCACCAT - 10bp (upstream too short)
Examples
Example 1: N40:N10 Library (Strict Validation)
Expected structure: 40bp variable + 10bp constant + 10bp variable = 60bp total
Output files:
n40n10_qc.validation.txt- Human-readable reportn40n10_qc.stats.json- Complete statisticsn40n10_qc.length_dist.csv- Length distributionn40n10_qc.upstream_dist.csv- Upstream distributionn40n10_qc.downstream_dist.csv- Downstream distributionn40n10_qc.structure_pairs.csv- Paired structure distributionn40n10_qc.filtered.fq.gz- Valid sequences onlyn40n10_qc_mqc.json- MultiQC data
Example 2: N25:N25 Library
Example 3: Mixed Library (OR Mode)
Accept sequences with ANY valid characteristic:
Example 4: Quality Filtering
Filter by quality score (FASTQ only):
Example 5: Structure Analysis Without Filtering
Analyze library structure without saving filtered sequences:
# Review structure pairs
Example 6: Complete Pipeline with MultiQC
# Process multiple samples
for; do
done
# Aggregate with MultiQC
# Use filtered outputs
Output Files
Reports
| File | Format | Description |
|---|---|---|
.validation.txt |
Text | Human-readable summary with all statistics |
.stats.json |
JSON | Complete statistics (machine-readable) |
.length_dist.csv |
CSV | Total sequence length distribution |
.upstream_dist.csv |
CSV | Upstream region length distribution |
.downstream_dist.csv |
CSV | Downstream region length distribution |
.structure_pairs.csv |
CSV | Paired (upstream, downstream) distribution |
_mqc.json |
JSON | MultiQC-compatible data |
Filtered Sequences
| File | Format | Description |
|---|---|---|
.filtered.fa |
FASTA | Valid sequences (uncompressed) |
.filtered.fq |
FASTQ | Valid sequences with quality (uncompressed) |
.filtered.fq.gz |
FASTQ.gz | Valid sequences with quality (compressed) |
MultiQC Report Contents
- Total/valid/filtered sequence counts
- Pass/filter rates
- Validation mode (AND/OR)
- Constant region detection rate
- Structure validation statistics
- Failure reason breakdown
- Filter efficiency metrics
Understanding Results
Text Report Example
RNA Capture-SELEX Library Validation Report
======================================================================
Configuration:
Constant region: TGGCCACCAT
Validation mode: AND (strict)
Total length range: 58 - 62 bp
Expected upstream length: 40 bp (+/- 2)
Expected downstream length: 10 bp (+/- 1)
Summary Statistics:
Total sequences: 10471867
Valid sequences: 9525680 (90.96%)
Invalid (filtered) sequences: 946187 (9.04%)
Validation Results:
Constant region present: 10450000 (99.79%)
Correct total length: 10300000 (98.36%)
Correct upstream: 9800000 (93.78% of sequences with constant)
Correct downstream: 10200000 (97.61% of sequences with constant)
Correct structure (paired): 9525680 (91.15% of sequences with constant)
Failure Reasons:
Incorrect structure: 850320 (89.87% of invalid sequences)
Incorrect total length: 171867 (18.16% of invalid sequences)
Missing constant region: 21867 (2.31% of invalid sequences)
Structure Pair Distribution:
(40, 10): 9000000 sequences (86.12%)
(39, 11): 300000 sequences ( 2.87%)
(41, 9): 225680 sequences ( 2.16%)
...
Structure Pair Distribution
Shows how many sequences have each (upstream, downstream) combination:
upstream_length,downstream_length,count,percentage
40,10,9000000,86.12
39,11,300000,2.87
41,9,225680,2.16
38,12,150000,1.43
This helps identify:
- Most common library structure
- Distribution of variants
- Potential issues with library preparation
Performance
Optimizations:
- Zero-copy parsing with
needletail - Streaming processing (10K sequence chunks)
- Lock-free atomic counters
- Fast substring search with Boyer-Moore (via
memchr)
Tips & Best Practices
Determining Optimal Parameters
- Initial exploration (no structure validation):
-
Check distributions in the report to see actual upstream/downstream lengths
-
Set tolerances based on observed distribution:
- 95% within range โ use that as tolerance
- Large variation โ consider OR mode or wider tolerance
Recommended Workflows
For homogeneous libraries (N40:N10, N25:N25):
- Use AND mode (strict)
- Set structure validation
- Enable filtering
- Use filtered output for downstream analysis
For mixed or exploratory libraries:
- Start with OR mode
- Review structure pairs distribution
- Refine parameters based on results
For quality control in pipelines:
- Enable MultiQC output
- Use consistent parameters across samples
- Compare QC metrics between samples
Batch Processing
# Process all samples in parallel (GNU parallel)
|
# Aggregate results
Troubleshooting
No Valid Sequences Found
Check:
- Constant region sequence is correct
- Constant region is in the same format (DNA/RNA, not complement)
- Review failure statistics in
.validation.txt
Debug:
# Verify constant region is present
# Check if case-sensitive issue (shouldn't be, but verify)
|
Low Structure Match Rate
If paired structure validation fails:
- Check
structure_pairs.csvfor actual distribution - If distribution shows near-misses, increase tolerances
- Consider if library actually has mixed structures (use OR mode)
Example:
# Check top structure pairs
# If you see (38,12) and (42,8) frequently, consider:
# - Increasing tolerance, OR
# - Using OR mode, OR
# - Library has multiple intended structures
Memory Issues
For very large files (>100M reads):
# Reduce thread count to lower memory usage
# Or process in chunks (external tool)
for; do
done
Technical Details
Dependencies
needletail- Fast FASTA/FASTQ parsingrayon- Data parallelismmemchr- Fast substring search (Boyer-Moore-Horspool)flate2- Gzip compression/decompressionserde/serde_json- Serializationcsv- CSV writing
Algorithms
Constant Region Search:
- Boyer-Moore-Horspool via
memchr::memmem - O(n) average case, O(nm) worst case
- SIMD-accelerated when available
Parallel Processing:
- Chunk-based processing (10K sequences per chunk)
- Rayon work-stealing thread pool
- Atomic counters for lock-free statistics
- Minimal memory overhead
Quality Calculation (FASTQ):
- Phred quality: Q = -10 * log10(P)
- ASCII conversion: Q = ASCII - 33
- Average across all positions
Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
License
MIT License - see LICENSE file for details
Citation
If you use selexqc in your research, please cite:
selexqc: High-performance quality control for RNA Capture-SELEX libraries
https://github.com/mulatta/selexqc
Support
- Issues: https://github.com/mulatta/selexqc/issues
- Documentation: https://github.com/mulatta/selexqc/wiki
- Discussions: https://github.com/mulatta/selexqc/discussions
Changelog
v0.1.0 (Initial Release)
- Fast parallel processing of FASTA/FASTQ/FASTQ.gz
- Case-insensitive constant region detection
- Structure validation (upstream/downstream pairs)
- AND/OR validation modes
- Filtering with multiple output formats
- MultiQC integration
- Comprehensive reporting (TXT/CSV/JSON)