snp-index
A lightweight, high-performance Rust crate for SNP-aware read matching and sparse matrix generation.
Overview
snp-index provides the core building blocks to:
- Load SNPs from a VCF file into a fast lookup index
- Convert BAM records into a normalized
AlignedRead - Optionally refine alignments using a reference genome
- Match reads against SNP loci (REF vs ALT)
- Aggregate results into cell × SNP sparse matrices
- Export results in 10x-compatible format
Key Features
- Fast SNP lookup via binned index
- BAM → AlignedRead conversion built-in
- Sequence-aware refinement (genome-based)
- UMI-aware deduplication via
scdata - Spliced read support (
N/ ref-skip) - 10x-compatible output
Architecture
BAM → AlignedRead::from_record()
FASTA → Genome (optional, for refinement)
VCF → SnpIndex
AlignedRead
↓ refine_against_genome()
SnpIndex.match_read()
↓
SnpReadMatch
↓
Scdata (cell × SNP × UMI)
↓
write_sparse()
Example Workflow
use ;
use GeneUmiHash;
// Load inputs
let genome = from_fasta?;
let snp_index = from_vcf_path?;
// Iterate BAM (pseudo-code)
for rec in bam.records
// export 10x-style matrices
scdata_ref.write_sparse?;
scdata_alt.write_sparse?;
Core Data Structures
AlignedRead
A normalized alignment representation:
- sequence + qualities
- CIGAR-like operations
- explicit reference coordinates
- independent of BAM after construction
let read = new;
SnpIndex
- stores SNP loci
- fast lookup via genomic bins
- implements
FeatureIndexforscdata
SnpReadMatch
Result of matching a read:
Refinement (Important)
Refinement fixes splice-edge artifacts such as:
10M1X100N15M → 10M100N16M
Only if the read base matches the reference genome at the corrected position.
This prevents destroying real mutations.
Known Limitations
Allele conflicts per UMI
If the same (cell, SNP, UMI) supports both REF and ALT:
- current behavior: effectively first-observation / duplicate suppression
- no explicit conflict resolution
Possible future strategies:
- majority vote per UMI
- discard conflicting UMIs
- separate “ambiguous” matrix
- quality-weighted decisions
Design Philosophy
- Keep SNP matching fast and pure
- Decouple logic from file formats
- Make everything testable with synthetic data
- Push statistical decisions downstream
Future Extensions
- multi-allelic SNP tracking
- UMI-level consensus refinement
- conflict-aware allele assignment
- SNP coverage / depth metrics
- performance optimizations (SIMD)
Status
Experimental but fully functional. Includes full integration tests (VCF + genome + reads → sparse output).
Author
Stefan Lang Lund University – Bioinformatics Core Facility