gtf_splice_index
A flexible, streaming GTF/GFF3 parser plus a splice-aware transcript index for fast matching of spliced reads against transcript models.
This crate is designed to sit between:
- Alignment layer (e.g. BAM → spliced blocks)
- Annotation layer (GTF/GFF transcripts)
and provide fast, deterministic transcript matching.
Coordinates are 0-based, half-open: [start, end).
Installation
You can add gtf_splice_index directly from GitHub.
Using cargo add (recommended)
This will add an entry like this to your Cargo.toml:
[]
= { = "https://github.com/stela2502/gtf_splice_index" }
Pin to a specific commit or branch (optional)
For reproducibility, you may want to pin a commit or branch.
Specific commit:
[]
= { = "https://github.com/stela2502/gtf_splice_index", = "COMMIT_HASH" }
Specific branch:
[]
= { = "https://github.com/stela2502/gtf_splice_index", = "main" }
Local development (path dependency)
If you’re developing both crates together:
[]
= { = "../gtf_splice_index" }
Main workflow
- Build a
SpliceIndexfrom a GTF/GFF file. - Convert an aligned read into a
SplicedRead. - Call
match_transcripts()to obtain best transcript matches.
This is the intended public entry point of the crate.
Core exported types
From the crate root:
pub use ;
SplicedRead
A SplicedRead is simply a list of aligned reference blocks.
Typically these come from a BAM CIGAR string.
Building the index from a GTF/GFF
High-level API:
use AnnotationBuilder;
// 1 Mb bins: good default for mammalian genomes
let index = new
.build_from_path
.unwrap;
Bin size parameter
The value passed to AnnotationBuilder::new() controls the genomic bin size
used for transcript bucketing.
Recommended values:
| Genome type | Bin size |
|---|---|
| Human / mouse | 1,000,000 |
| Fly / yeast | 100,000 |
| Bacteria | 10,000–50,000 |
For most users: 1,000,000 is a safe default.
Streaming parser (advanced usage)
If you need low-level access to annotation records:
use File;
use BufReader;
use AnnotationReader;
let file = open.unwrap;
let reader = new;
let rdr = new;
for rec in rdr.records
Most users should use AnnotationBuilder instead.
MatchOptions: controlling what “match” means
Matching behavior is configured using MatchOptions.
Semantics
- Overhangs are always measured and returned.
- Options only control whether a read is considered valid.
- Small internal gaps can be tolerated to avoid false junction mismatches.
Matching a read against transcripts
The main API:
let hits = index.match_transcripts;
Complete example
use ;
// 1) Build index
let index = new
.build_from_path
.unwrap;
// 2) Build spliced read
let mut read = new;
read.finalize;
// 3) Configure matching
let opts = MatchOptions ;
// 4) Match
let hits = index.match_transcripts;
for hit in hits
Match classes (conceptual overview)
Typical classifications:
-
ExactJunctionChain
Read junction chain equals transcript junction chain. -
Compatible
Read junctions are a subset of transcript junctions. -
JunctionMismatch
Read contains junction(s) not present in transcript. -
Intronic
Read overlaps transcript span but includes intronic sequence. -
OverhangTooLarge
End overhang exceeds configured threshold. -
StrandMismatch
Strand incompatible (if required). -
NoOverlap
No genomic overlap with transcript.
The returned MatchHit always includes measured 5′ and 3′ overhangs.
Typical integration pattern
- Parse BAM record → build
SplicedRead - Call
index.match_transcripts() - Select best hit(s) by match class and overhangs
- Assign transcript or gene label
Performance characteristics
- Streaming annotation parser
- Binned transcript index for fast candidate lookup
- Junction-based transcript filtering
- Deterministic matching results
License
Add your license here (e.g. MIT / Apache-2.0).