Skip to main content

gbz_base/
lib.rs

1//! # GBZ-base and GAF-base: pangenome file formats using SQLite databases.
2//!
3//! # GBZ-base
4//!
5//! This is a prototype for storing a GBZ graph in a SQLite database.
6//! It is intended for interactive applications that need immediate access to the graph.
7//! In such applications, the overhead from loading the GBZ graph into memory can be significant (e.g. 20 seconds for a human graph).
8//! As long as the application needs only a fraction of the entire graph (e.g. 1 Mbp context in a human graph), using the database is faster than loading the graph.
9//! This assumes that the database is stored on a local SSD.
10//!
11//! The prototype builds on the [`gbz`] crate.
12//!
13//! See [`GBZBase`], [`GraphInterface`], and [`Subgraph`] for the database interface.
14//! See [`GBZPath`] and [`GBZRecord`] for the related structures.
15//!
16//! ### Basic concepts
17//!
18//! Nodes are accessed by handles, which are [`gbz::GBWT`] node identifiers.
19//! A handle encodes both the identifier of the node in the underlying graph and its orientation.
20//! Each node record corresponds to a row in table `Nodes`, with the handle as its primary key.
21//!
22//! Paths are accessed by handles, which are path identifiers in the original graph.
23//! Each path record corresponds to a row in table `Paths`, with the handle as its primary key.
24//! The record contains information for both orientations of the path.
25//!
26//! Paths can be indexed for random access, which can be useful for e.g. finding a graph region by its reference coordinates.
27//! Indexing is based on storing the sequence offset and the GBWT position at the start of a node once every ~1000 bp.
28//! The indexed positions are stored in table `ReferenceIndex`.
29//! By default, only generic paths (sample name `_gbwt_ref`) and reference paths (sample name listed in GBWT tag `reference_samples`) are indexed.
30//! The database can become excessively large if all paths are indexed.
31//!
32//! # GAF-base
33//!
34//! This is a prototype for an SQLite-based file format for sequence alignments to a pangenome graph.
35//! It is mostly compatible with the GAF format.
36//! Target paths are stored as a GBWT index in table `Nodes`, which is similar to the table in GBZ-base.
37//!
38//! The default GAF-base is reference-based and requires the corresponding GBZ graph or GBZ-base for decoding the alignments.
39//! It is also possible to build a reference-free GAF-base that stores the node sequences in table `Nodes`.
40//!
41//! Alignment metadata is stored in table `Alignments`.
42//! Each row in the table corresponds to a block of alignments, which are assumed to be close in the graph.
43//! The metadata is stored space-efficiently using column-based compression.
44//!
45//! `Alignment` table is indexed by (minimum handle, maximum handle) in the target paths.
46//! Given a query subgraph, we want to find blocks where the interval overlaps with the subgraph.
47//! These blocks must then be decompressed, as they may contain alignments to the subgraph.
48//!
49//! See [`GAFBase`] and [`ReadSet`] for the database interface.
50//! See [`alignment`], [`Alignment`], and [`AlignmentBlock`] for more details.
51
52pub mod alignment;
53pub mod db;
54pub mod formats;
55pub mod gaf_sort;
56pub mod path_index;
57pub mod read_set;
58pub mod subgraph;
59pub mod utils;
60
61// Shared utility functions for tests.
62#[cfg(test)]
63pub(crate) mod internal;
64
65pub use alignment::{Alignment, AlignmentBlock};
66pub use alignment::mapping::{Difference, Mapping};
67pub use db::{GBZBase, GBZPath, GBZRecord, GraphInterface, GraphReference};
68pub use db::{GAFBase, GAFBaseParams};
69pub use path_index::PathIndex;
70pub use read_set::{ReadSet, AlignmentOutput};
71pub use subgraph::Subgraph;
72pub use subgraph::query::{SubgraphQuery, HaplotypeOutput};