gtf_splice_index

A flexible, streaming GTF/GFF3 parser plus a splice-aware transcript index for fast matching of spliced reads against transcript models.

This crate is designed to sit between:

Alignment layer (e.g. BAM → spliced blocks)
Annotation layer (GTF/GFF transcripts)

and provide fast, deterministic transcript matching.

Coordinates are 0-based, half-open: [start, end).

Installation

You can add gtf_splice_index directly from GitHub.

Using `cargo add` (recommended)

cargo add gtf_splice_index --git https://github.com/stela2502/gtf_splice_index

This will add an entry like this to your Cargo.toml:

[dependencies]
gtf_splice_index = { git = "https://github.com/stela2502/gtf_splice_index" }

Pin to a specific commit or branch (optional)

For reproducibility, you may want to pin a commit or branch.

Specific commit:

[dependencies]
gtf_splice_index = { git = "https://github.com/stela2502/gtf_splice_index", rev = "COMMIT_HASH" }

Specific branch:

[dependencies]
gtf_splice_index = { git = "https://github.com/stela2502/gtf_splice_index", branch = "main" }

Local development (path dependency)

If you’re developing both crates together:

[dependencies]
gtf_splice_index = { path = "../gtf_splice_index" }

Main workflow

Build a SpliceIndex from a GTF/GFF file.
Convert an aligned read into a SplicedRead.
Call match_transcripts() to obtain best transcript matches.

This is the intended public entry point of the crate.

Core exported types

From the crate root:

pub use types::{RefBlock, SplicedRead, Strand};

`SplicedRead`

#[derive(Debug, Clone, PartialEq, Eq)]
pub struct SplicedRead {
    pub chr_id: usize,
    pub strand: Strand,
    pub blocks: Vec<RefBlock>,
    finalized: bool,
}

A SplicedRead is simply a list of aligned reference blocks.

Typically these come from a BAM CIGAR string.

Building the index from a GTF/GFF

High-level API:

use gtf_splice_index::AnnotationBuilder;

// 1 Mb bins: good default for mammalian genomes
let index = AnnotationBuilder::new(1_000_000)
    .build_from_path("genes.gtf")
    .unwrap();

Bin size parameter

The value passed to AnnotationBuilder::new() controls the genomic bin size used for transcript bucketing.

Recommended values:

Genome type	Bin size
Human / mouse	1,000,000
Fly / yeast	100,000
Bacteria	10,000–50,000

For most users: 1,000,000 is a safe default.

Streaming parser (advanced usage)

If you need low-level access to annotation records:

use std::fs::File;
use std::io::BufReader;
use gtf_splice_index::annotation::io::AnnotationReader;

let file = File::open("genes.gtf").unwrap();
let reader = BufReader::new(file);

let rdr = AnnotationReader::new(reader);
for rec in rdr.records() {
    let rec = rec.unwrap();
    println!("{} {}-{}", rec.seqname, rec.start0, rec.end0);
}

Most users should use AnnotationBuilder instead.

MatchOptions: controlling what “match” means

Matching behavior is configured using MatchOptions.

#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct MatchOptions {
    /// If true, require read blocks to be on a compatible strand.
    pub require_strand: bool,

    /// If true, require the read to have the exact same splice junction chain as the transcript.
    pub require_exact_junction_chain: bool,

    /// Maximum allowed 5′ overhang (bp). If exceeded -> OverhangTooLarge.
    pub max_5p_overhang_bp: u32,

    /// Maximum allowed 3′ overhang (bp). If exceeded -> OverhangTooLarge.
    pub max_3p_overhang_bp: u32,

    /// Allowed sequencing micro-gap inside exons.
    /// Gaps ≤ this size are ignored when computing junctions.
    /// Larger gaps become real junctions and may lead to JunctionMismatch.
    pub allowed_intronic_gap_size: u32,
}

Semantics

Overhangs are always measured and returned.
Options only control whether a read is considered valid.
Small internal gaps can be tolerated to avoid false junction mismatches.

Matching a read against transcripts

The main API:

let hits = index.match_transcripts(&read, opts);

Complete example

use gtf_splice_index::{
    AnnotationBuilder,
    RefBlock,
    SplicedRead,
    Strand,
    MatchOptions,
};

// 1) Build index
let index = AnnotationBuilder::new(1_000_000)
    .build_from_path("genes.gtf")
    .unwrap();

// 2) Build spliced read
let mut read = SplicedRead::new(
    1,
    Strand::Plus,
    vec![
        RefBlock::new(100, 150),
        RefBlock::new(200, 250),
    ],
);
read.finalize();

// 3) Configure matching
let opts = MatchOptions {
    require_strand: true,
    require_exact_junction_chain: false,
    max_5p_overhang_bp: 10,
    max_3p_overhang_bp: 10,
    allowed_intronic_gap_size: 5,
};

// 4) Match
let hits = index.match_transcripts(&read, opts);

for hit in hits {
    println!(
        "transcript={} class={:?} over5={} over3={}",
        hit.transcript.primary_name().unwrap_or("unknown"),
        hit.hit.class,
        hit.hit.overhang_5p_bp,
        hit.hit.overhang_3p_bp,
    );
}

Match classes (conceptual overview)

Typical classifications:

ExactJunctionChain
Read junction chain equals transcript junction chain.
Compatible
Read junctions are a subset of transcript junctions.
JunctionMismatch
Read contains junction(s) not present in transcript.
Intronic
Read overlaps transcript span but includes intronic sequence.
OverhangTooLarge
End overhang exceeds configured threshold.
StrandMismatch
Strand incompatible (if required).
NoOverlap
No genomic overlap with transcript.

The returned MatchHit always includes measured 5′ and 3′ overhangs.

Typical integration pattern

Parse BAM record → build SplicedRead
Call index.match_transcripts()
Select best hit(s) by match class and overhangs
Assign transcript or gene label

Performance characteristics

Streaming annotation parser
Binned transcript index for fast candidate lookup
Junction-based transcript filtering
Deterministic matching results

License

Add your license here (e.g. MIT / Apache-2.0).

gtf_splice_index 0.4.0