Expand description
§GRanges: Generic Genomic Range and Data Containers
GRanges is a Rust library for working with genomic ranges and their associated data. GRanges also implements a bedtools-like command line tool as a test and benchmark of the library. In benchmarks, this command line tool is reliably 30-40% faster than bedtools. Internally, GRanges uses the very fast coitrees library written by Daniel C. Jones for overlap operations.
The GRanges library aims to simplify the creation of powerful, performant genomics tools in Rust. GRanges is inspired by the design and ease of use of Bioconductor’s GenomicRanges and plyranges packages, but tools using GRanges are optimized at compile time and thus are typically much faster.
§Table of Contents
- GRanges: Generic Genomic Range and Data Containers
- The Overlaps–Map-Combine Pattern
- Example GRanges Workflow
- Loading Data into GRanges
- Sequence types
- Manipulating GRanges objects
- Future Development
- Documentation Guide
§GRanges Design
Nearly all common data processing and analytics tasks in science can be thought of as taking some raw input data and refining it down some pipeline, until eventually outputting some sort of statistical result. In bioinformatics, genomics data processing workflows are often composed from command line tools and Unix pipes. Analogously, tidyverse unifies data tidying, summarizing, and modeling by passing and manipulating a datafame down a similar pipeline. Unix pipes and tidyverse are powerful because they allow the scientist to build highly specialized analytic tools just by remixing the same core set of data operations.
GRanges implements many of the same core data joining and manipulation operations as the plyranges library and R’s tidyverse.
§The GRanges
container
Central to the GRanges library is the generic GRanges<R, T>
data container. This is generic
over the range container type R
, because range data structures have inherent performance
tradeoffs for different tasks. For example, when parsing input data into memory, the total size
is often not known in advanced. Consequently, the most efficient range container data structure
is a dynamically-growing Vec<U>
. However, when overlaps need to be computed between ranges
across two containers, the interval tree data structures are needed. GRanges
supports
conversion between these range container types.
GRanges<R, T>
is also generic over its data container, T
. This allows it to hold any
type of data, in any data structure: a Vec<f64>
of numeric stores, an
ndarray::Array2
matrix, a
polars dataframe, custom struct
s, etc. Each range in the ranges container
indexes an element in the data container. There is also a special GRangesEmpty
type for
data-less GRanges
objects.
The GRanges
type implements a small but powerful set of methods that can be used to process
and analyze genomic data. GRanges relies heavily on method-chaining operations that pass
genomic range data down a pipeline, similar to tidyverse and Unix pipes.
§The Overlaps–Map-Combine Pattern
A very common data analysis strategy is Split-Apply-Combine pattern (Wickham, 2011). When working with genomic overlaps and their data, GRanges library relies on its own data analysis pattern known as Overlaps–Map-Combine pattern, which is better suited for common genomic operations and analyses.
§Overlap Joins and the Left Grouped Join
In this pattern, the left genomic ranges are “joined” by their overlaps with another right genomic ranges. In other words, the right ranges that overlap a single left range can be thought of “joined” to this left range.
With genomic data, downstream processing is greatly simplified if the results of this join are grouped by what left range they overlap (see illustration below). This is a special kind of join known as a left grouped join.
Left range: ███████████████████████████████████
Right ranges: ██ ██████ ██████ ██████
In GRanges, a left grouped join (GRanges::left_overlaps()
) returns a GRanges
object
containing the exact same ranges as the input left ranges. In other words, left grouped
joins are endomorphic in the range container. Computationally, this is also extremely
efficient, because the left ranges can be passed through to the results directly (a
zero-overheard operation).
The GRanges
object returned by GRanges::left_overlaps()
contains a JoinData
(or related)
type as its data container. The JoinData
contains information about each left range’s overlaps (e.g. number of
overlapping bases, fraction of overlaps, etc) in a Vec<LeftGroupedJoin>
. This information about
overlapping ranges may then be used by downstream calculations.
Additionally JoinData
stores the left ranges’ data container, and has a reference to
the right ranges’ data container. In both cases, the data is unchanged by the join.
Downstream processing can also easily access the left ranges’s data and all overlapping right ranges’
data for calculations.
§Map-Combine over Joins
After a left grouped join, each left range can have zero or more overlapping right ranges.
A map-combine operation (optionally) maps a function over all right ranges’ data and
overlaps information, and then combines that into a single data entry per left range. These
combined data entries make up a new GRanges<R, Vec<V>>
data container, returned by GRanges::map_joins()
.
Note that like GRanges::left_overlaps()
, the GRanges::map_joins()
is endomorphic
over its range container. This means it can be passed through without modification, which
is computationally efficient. This results from a Map-Combine operation then can be overlap joined with
other genomic ranges, filtered, have its data arbitrarily manipulated by GRanges::map_data()
, etc.
§Example GRanges Workflow
To illustrate these ideas, lets look at an example of how we might use GRanges to do a
a commonly-encountered genomic calculation. Suppose you had a set of exon genomic ranges (the left
ranges) and a multicolumn TSV (right ranges) of various genomic features (e.g. assay
results, recombination hotspots, variants, etc). Imagine you wanted to form some statistic per
each exon, based on data in the overlapping right ranges. This operation is very
similar to bedtools map
, except that suppose the statistic you want to calculate is not one of the
few offered by bedtools. GRanges makes it simple to compose your own, very fast genomic tools to answer these
questions.
To see how GRanges would make it simple to compose a specialized fast tool to solve this
problem, let’s fist see how few lines of code it would take to implement bedtools map
,
since our problem is akin to using bedtools map
with our own function to calculate our statistic.
Let’s start by getting the mean score by seeing how to get the mean BED5 score across all
overlapping right ranges for each left range (i.e. bedtools map -a <left> -b <right> -c 5 mean
).
Here is the Rust code to do this using GRanges:
// Mock sequence lengths (e.g. "genome" file)
let genome = seqlens!("chr1" => 100, "chr2" => 100);
// Create parsing iterators to the left and right BED files.
let left_iter = Bed3Iterator::new("tests_data/bedtools/map_a.txt")?;
let right_iter = Bed5Iterator::new("tests_data/bedtools/map_b.txt")?;
// Filter out any ranges from chromosomes not in our genome file.
let left_gr = GRangesEmpty::from_iter(left_iter, &genome)?;
let right_gr = GRanges::from_iter(right_iter, &genome)?;
// Create the "right" GRanges object, convert the ranges to an
// interval trees, and tidy it by selecting out a f64 score.
let right_gr = right_gr
// Convert to interval trees.
.into_coitrees()?
// Extract out just the score from the additional BED5 columns.
.map_data(|bed5_cols| {
bed5_cols.score
})?;
// Find the overlaps by doing a *left grouped join*.
let left_join_gr = left_gr.left_overlaps(&right_gr)?;
// Process all the overlaps.
let results_gr = left_join_gr.map_joins(|join_data| {
// Get the "right data" -- the BED5 scores.
let overlap_scores: Vec<f64> = join_data.right_data.into_iter()
// filter out missing values ('.' in BED)
.filter_map(|x| x).collect();
// calculate the mean
let score_sum: f64 = overlap_scores.iter().sum();
score_sum / (overlap_scores.len() as f64)
})?;
// Write to a TSV file, using the BED TSV format standards
// for missing values, etc.
let tempfile = tempfile::NamedTempFile::new().unwrap();
results_gr.write_to_tsv(Some(tempfile.path()), &BED_TSV)?;
In relatively few lines of code, we can implement core functionality of bedtools. As mentioned earlier, GRanges code typically runs 30-40% faster than bedtools.
§Loading Data into GRanges
Nearly all genomic range data enters GRanges through parsing iterators (see the parsers
module). Parsing iterators read input data from some bioinformatics file format (e.g. .bed
,
.gtf.gz
, etc) and it then goes down one of two paths:
-
Streaming mode: the entire data set is never read into memory all at once, but rather is processed entry by entry.
-
In-memory mode: the entire data set is read into memory all at once, usually as a
GRanges<R, T>
object. TheGRanges<R, T>
object is composed of two parts: a ranges container (ranges) and a data container (data).GRanges
objects then provide high-level in-memory methods for working with this genomic data.
In the example above, we loaded the items in the parsing iterators directly into
GRanges
objects, since we had to do overlap operations (in the future, GRanges will
support streaming overlap joins for position-sorted input data).
Both processing modes eventually output something, e.g. a TSV of some statistic calculated on the genomic data, the output of 10 million block bootstraps, or the coefficients of linear regressions run every kilobase on sequence data.
The GRanges workflow idea is illustrated in this (super cool) ASCII diagram:
+-------+ +-------------------+ +----------------------+
| Input | -----> | Parsing Iterator | -----> | GRanges object |
+------ + | (streaming mode) | | (in-memory mode) |
+-------------------+ +----------------------+
| |
| |
v v
+-----------------+ +-----------------+
| Data Processing | | Data Processing |
+-----------------+ +-----------------+
| |
v v
+--------+ +--------+
| Output | | Output |
+--------+ +--------+
§Sequence types
Genomic processing also involves working with sequence data types, which abstractly are per-basepair data that cover the entire genome (and in the case of the reference sequence, define the genome coordinates). Nucleotide sequences are just a special case of this, where the per-basepair data are nucleotides (which under the hood are just an unsigned byte-length integer). There are many other per-basepair data types, e.g. conservation scores.
Sequence data types are like a GRanges
where the range container
is implied as per-basepair, and the data container contains the per-basepair data.
GRanges allows arbitrary data to be stored and manipulated (see the sequences
module) in sequence types: numeric data, nucleotides, and arrays per basepair.
Since loading an entire genome’s per-basepair data into memory would be inefficient,
sequence types implement lazy-loading.
§Manipulating GRanges objects
-
Creation:
GRanges::new_vec()
,GRanges::from_iter()
,GRangesEmpty::from_windows()
. -
Range-modifying functions:
GRanges::into_coitrees()
,GRanges::adjust_ranges()
,GRanges::sort()
,GRanges::flanking_ranges()
,GRanges::filter_overlaps()
. -
Data-modifying functions:
GRanges::left_overlaps()
,GRanges::map_joins()
,GRanges::map_data()
. -
Range and Data modifying functions:
GRanges::push_range()
.
Note that not all range and data container types support these operations, e.g.
GRanges::push_range()
is not available for ranges stored in an interval tree.
However, by design, this is known at compile time, due Rust’s typing system. Thus
there is lower risk of runtime panics, since more potential issues are caught at
compile time.
§Future Development
When I was first learning Rust as a developer, one thing that struck me about the language is that it feels minimal but never constraining. This is quite a contrast compared to languages like C++. As Rust developers know, the minimalness was by design; the cultural norm of slow growth produced a better final product in the end.
The GRanges library attempts to follow this design too. Currently, the alpha release is about implementating a subset of essential core functionality to benchmark against alternative software. Since the design and ergonomics of the API are under active development, please, please, create a GitHub issue if:
-
You want a particular feature.
-
You don’t know how you’d implement a feature yourself with GRanges.
-
The “ergonomics” don’t feel right.
§Current Limitations
The biggest current limitations are:
-
No native support for GTF/GFF/VCF. It is relatively easy to write custom parsers for these types, or use the noodles libary’s parsers.
-
No out-of-memory overlap operations. This is relatively easy to add, and will be added in future.
-
No BAM (BCF, or other binary) file support. Again, these formats can be easily loaded by wrapping the parsers in the noodles libary. Future version of GRanges will likely have a
--features=nooodles
option, which will provide convenience for this. -
Parallelized operations are not in the
main
branch yet. I have written parallel iterators that can be used for operations, but these need to be ported over to the latest design. -
Lack of TSV column type inference methods. This would make loading and working with arbitrary TSV files with heterogeneous column types easier.
⚠️:Note that the GRanges API, as well as type names, may change.
§Contributing
If you are interested in contributing to GRanges, please contact Vince Buffalo (@vsbuffalo on Twitter and GitHub), starting a new discussion, or creating a GitHub issue.
§Documentation Guide
- ⚙️ : technical details that might only be useful to developers handling tricky cases.
- ⚠️: indicates a method or type that is unstable (it may change in future implementations), or is “sharp” (developer error could cause a panic).
- 🚀: optimization tip.
- 🐌: indicates slow code in need of future optimization.
Re-exports§
pub use crate::error::GRangesError;
pub use indexmap;
Modules§
- commands
- Command functions that implement each of the
granges
subcommands. - data
- Data container implementations.
- error
- The
GRangesError
enum
definition and error messages. - granges
- The
GRanges<R, T>
andGRangesEmpty
types, and associated functionality. - io
- Types and methods for reading and parsing input and writing output.
- iterators
- Iterators over genomic ranges.
- join
LeftGroupedJoin
,JoinData
, andJoinDataIterator
types for overlaps.- merging_
iterators - Merging iterators
- prelude
- The main exports of the GRanges library.
- ranges
- Range types and Range Containers.
- reporting
- Types for standardized reports to the user about range operations.
- sequences
- Functionality for working with per-basepair data.
- test_
utilities - Test cases and test utility functions.
- traits
- Traits used by the GRanges library.
- unique_
id - The
UniqueIdentifier
type, which stores a mapping between some key type and ausize
index.
Macros§
- create_
granges_ with_ seqlens - Create a new
GRanges<R, T>
with sequence length information (used primarily for small examples) - ensure_
eq - a version of assert_eq! that is more user-friendly.
- seqlens
- Create an
IndexMap
of sequence names and their lengths.
Constants§
Type Aliases§
- Position
- The main position type in GRanges.
- Position
Offset - The main signed position type in GRanges, to represent offsets (e.g. for adjust range coordinates, etc).