piscem-rs
A Rust implementation of piscem, a tool for k-mer-based read mapping against a compacted de Bruijn graph index built on the SSHash data structure.
piscem-rs produces semantically equivalent output to the C++ piscem, with 100% record-level parity across all tested modes and datasets.
Features
- Index construction from cuttlefish output (
.cf_seg,.cf_seq,.json) - Bulk RNA-seq mapping (single-end and paired-end) to RAD format
- Single-cell RNA-seq mapping with barcode-aware protocols (10x Chromium V2/V3/V4, custom geometries)
- Single-cell ATAC-seq mapping with Tn5 shift correction, mate overlap detection, and genome binning
- Poison k-mer filtering via decoy-aware index construction
- Permissive and strict k-mer skipping strategies with contig-walking acceleration
Quick start
Building an index
# From cuttlefish output
# With equivalence class table
Mapping reads
# Bulk paired-end
# Single-cell RNA (10x Chromium V3)
# Single-cell ATAC
Building from source
Requires Rust 1.85+.
The binary will be at target/release/piscem-rs.
Parity with C++ piscem
piscem-rs is validated against C++ piscem using record-level RAD output comparison:
| Mode | Dataset | Mapping Rate | Record Parity |
|---|---|---|---|
| Bulk PE | gencode v44 (1M reads) | 96.46% | 100% |
| Bulk PE + poison | gencode v44 (1M reads) | 96.15% | 100% |
| Bulk PE strict | gencode v44 (1M reads) | 96.46% | 100% |
| scRNA | SRR12623882 (Chromium V3) | — | 100% |
| scATAC | 5M ATAC reads (hg38 k25) | 98.33% | 100% |
Performance
Mapping performance on 1M paired-end reads (gencode v44, Apple Silicon M2 Max):
| Threads | C++ | Rust | Ratio |
|---|---|---|---|
| 1 | 13.7s | 12.5s | 0.91x |
| 4 | 3.9s | 3.2s | 0.83x |
| 8 | 3.3s | 3.1s | 0.92x |
Architecture
piscem-rs uses a modular architecture:
sshash-rs— Rust port of the SSHash compressed k-mer dictionary, with streaming query support and PHast minimal perfect hash functions- Index layer — ContigTable (Elias-Fano + packed entries), RefInfo, EqClassMap, PoisonTable
- Mapping engine — Sketch-based hit collection with permissive/strict k-mer skipping, paired-end merge, poison filtering
- Protocol layer — Pluggable protocol trait for bulk, scRNA, and scATAC workflows
- I/O layer — RAD binary output, chunked FASTQ reading via paraseq, crossbeam thread pool
License
BSD 3-Clause