Rype
High-performance genomic sequence classification using minimizer-based k-mer sketching in RY (purine/pyrimidine) space.
Overview
Rype is a Rust library and CLI tool for fast sequence classification. It uses a reduced 2-bit alphabet that collapses purines (A/G) and pyrimidines (T/C), enabling:
- Mutation tolerance: Purine-purine and pyrimidine-pyrimidine mutations don't break k-mer matches
- Compact representation: 64bp k-mers fit in a single
u64 - High performance: Minimizer sketching with O(n) extraction using monotonic deques
Installation
From source
Building
# Release build
# Development build
Quick Start
Create an index
# Create index from reference sequences
# Create index with one bucket per sequence
# Build from a TOML configuration file
# Merge two indices
Classify sequences
# Single-end reads
# Paired-end reads
# Best-hit-only classification
# Trim reads to first N bases before classification
# Host depletion with a negative index
# Sample-level aggregated classification
# Log-ratio scoring with two single-bucket indices
Index management
# Show index statistics
CLI Reference
rype index
| Subcommand | Description |
|---|---|
create |
Build index from FASTA/FASTQ files |
from-config |
Build index from TOML configuration file |
merge |
Merge two indices (with optional subtraction) |
stats |
Show index statistics |
bucket-source-detail |
Show source details for a specific bucket |
summarize |
Show detailed minimizer statistics (legacy non-Parquet indices only) |
rype classify
| Subcommand | Description |
|---|---|
run |
Per-read classification |
aggregate |
Sample-level aggregated classification (alias: agg) |
log-ratio |
Log-ratio scoring using two single-bucket indices |
Configuration File
Build complex indices using a TOML configuration file:
[]
= "output.ryxdi"
= 50
= 0x5555555555555555 # XOR salt for hashing
# k = 64 # K-mer size (default: 64)
[]
= ["ref_a1.fasta", "ref_a2.fasta"]
[]
= ["ref_b.fasta"]
Library Usage
use parse_fastx_file;
use ;
C API
Rype provides a C API for FFI integration:
int
Build and run:
LD_LIBRARY_PATH=target/release
Thread Safety
- Index loading/freeing: NOT thread-safe
- Classification (
rype_classify): Thread-safe (multiple threads can share the same index) - Results: NOT thread-safe (each thread needs its own
RypeResultArray)
Algorithm Details
RY Encoding
The library uses a reduced 2-bit alphabet:
| Base | Category | Encoding |
|---|---|---|
| A, G | Purine | 1 |
| T, C | Pyrimidine | 0 |
| N, etc. | Invalid | Resets k-mer |
Minimizer Extraction
- Slide a window of size
wover k-mers - Select the minimum hash value within each window
- Deduplicate consecutive identical minimizers
The implementation uses a monotonic deque for O(n) time complexity.
Supported K-mer Sizes
- K = 16, 32, or 64 (stored as
u64)
Index Format
Rype uses a Parquet-based inverted index format stored as a directory:
index.ryxdi/
├── manifest.toml # Index metadata (k, w, salt, bucket info)
├── buckets.parquet # Bucket metadata (id, name, sources)
└── inverted/
├── shard.0.parquet # Inverted index shard (minimizer -> bucket_id pairs)
└── ... # Additional shards for large indices
This format provides:
- Efficient columnar storage with compression
- Memory-efficient streaming classification
- Support for indices larger than available RAM
Performance Tips
- Use
--releasebuilds for production - Batch processing amortizes thread pool overhead
- Parquet format enables memory-efficient classification for large indices
Testing
# Run all tests
# Run tests with output
# Run specific test
Development Setup
# Enable pre-commit hooks (runs fmt, clippy, tests, doc checks)
# Manually run checks
RUSTDOCFLAGS="-D warnings"
The test suite includes automated verification that README examples compile and run correctly (readme_example_test.rs and cli_integration_tests.rs::test_readme_bash_examples).
License
Licensed under the BSD 3-Clause License.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.