omics_coordinate/lib.rs
1//! Coordinates upon a molecule.
2//!
3//! A **coordinate** is the fundamental unit for describing a location within a
4//! genome. Coordinates point to a single location within a contiguous molecule
5//! (typically a nucleic acid molecule, such as DNA or RNA, or a protein) and
6//! are specified at the _nucleotide_ level of abstraction.
7//!
8//! Coordinates are comprised of three components:
9//!
10//! * The name of the molecule upon which the coordinate sits is known as the
11//! [**contig**](crate::Contig).
12//! * Each molecule is made of a contiguous series of elements. The offset of
13//! the selected element with respect to the starting element of the molecule
14//! is known as the [**position**](crate::Position).
15//! * Optionally, if the molecule is stranded, the strand upon which the
16//! coordinate sits is known as the [**strand**](crate::Strand).
17//!
18//! Coordinates, via their positions, can fall within the _interbase_ coordinate
19//! system (which is closely related to the 0-based, half-open coordinate
20//! system) or the _in-base_ coordinate system (closely related to the 1-based,
21//! full-closed coordinate system). In this crate, the interbase coordinate
22//! system is denoted using the `interbase`/`Interbase` identifiers, and the
23//! in-base coordinate system is denoted using the `base`/`Base` identifiers (we
24//! didn't like the way `in_base`/`InBase` looked).
25//!
26//! If you want to learn more about the supported coordinate systems, or if you
27//! want to learn why this crate uses the terms that it does (e.g., "in-base"
28//! instead of "1-based"), please jump to [this section](crate#positions) of the
29//! docs.
30//!
31//! ### Scope
32//!
33//! At present, `omics-coordinate` is focused almost exclusively on nucleic acid
34//! molecules. In the future, however, we expect to expand this to cover
35//! proteins as well.
36//!
37//! ### Quickstart
38//!
39//! To get started, you'll need to decide if you want to use interbase or
40//! in-base coordinates. This decision largely depends on your use case, the
41//! consumers of the data, and the context of both (a) where input data is
42//! coming from and (b) where output data will be shared. Note that, if you're
43//! working with a common bioinformatics file format, the coordinate system is
44//! often dictated by the format itself. If you need help deciding which
45//! coordinate system to use, you should start by reading [the positions
46//! section](#positions) of the docs.
47//!
48//! Once you've decided on which coordinate system you'd like to use, you can
49//! create coordinates like so:
50//!
51//! ```
52//! use omics_coordinate::Coordinate;
53//! use omics_coordinate::system::Base;
54//! use omics_coordinate::system::Interbase;
55//!
56//! // An interbase coordinate.
57//! let coordinate = Coordinate::<Interbase>::try_new("seq0", "+", 0)?;
58//! println!("{:#}", coordinate);
59//!
60//! // A in-base coordinate.
61//! let coordinate = Coordinate::<Base>::try_new("seq0", "+", 1)?;
62//! println!("{:#}", coordinate);
63//!
64//! # Ok::<(), Box<dyn std::error::Error>>(())
65//! ```
66//!
67//! For convenience, the crate also provides type aliases for the interbase and
68//! in-base variants of the relevant concepts. For example, you can use a
69//! [`Position<Interbase>`] by instead simply importing a
70//! [`zero::Position`](crate::position::zero::Position).
71//!
72//! ```
73//! use omics_coordinate::interbase::Coordinate;
74//!
75//! let coordinate = Coordinate::try_new("seq0", "+", 0)?;
76//! println!("{:#}", coordinate);
77//!
78//! # Ok::<(), Box<dyn std::error::Error>>(())
79//! ```
80//!
81//! # Background
82//!
83//! Coordinate systems can be surprisingly hard to find comprehensive,
84//! authoritative material for and, thus, have a reputation for being confusing
85//! to newcomers to the field. To address this lack of material and to describe
86//! how terms are used within this crate, the authors lay out their
87//! understanding of the history behind the terminology used in the community
88//! and then cover their perspective on what terms are most appropriate to be
89//! used within different contexts. Notably, this may not match the worldview of
90//! other popular resources or papers out there. In these cases, departures from
91//! convention are noted alongside carefully reasoned opinions on why the
92//! departure was made.
93//!
94//! ## Biology Primer
95//!
96//! Before diving into the coordinate system-specific details, we must first lay
97//! some groundwork for terms used within genomics in general. These definitions
98//! serve as a quick overview to orient you to the discussion around coordinate
99//! systems—if you're interested in more detailed information, you can learn
100//! more at [https://learngenomics.dev](https://learngenomics.dev).
101//!
102//! * A **genome** is the complete set of genetic code stored within a cell
103//! ([learn more](https://www.genome.gov/genetics-glossary/Genome)).
104//! * **Deoxyribose nucleic acid**, or **DNA**, is a molecule that warehouses
105//! the aforementioned genetic code. In eukaryotic cells, DNA resides in the
106//! nucleus of a cell.
107//! * DNA is stored as a sequence of **nucleotides** (i.e., `A`, `C`, `G`,
108//! and `T`).
109//! * DNA is double-stranded, meaning there are two, complementary sequences
110//! of nucleotides that run in antiparallel.
111//! * **Ribonucleic acid**, or **RNA**, is a molecule that is _transcribed_ from
112//! a particular stretch of DNA.
113//! * RNA is _also_ stored as sequence of nucleotides (though, in this case,
114//! the nucleotides are `A`, `C`, `G`, and `U`).
115//! * RNA is single-stranded, meaning that it represents the transcription
116//! of only one of the strands of DNA.
117//! * RNA generally either (a) serves as a template for the production of a
118//! protein or (b) has some functional role in and of itself.
119//! * **Proteins** are macromolecules that are assembled by _translating_ the
120//! nucleotide sequence stored with an RNA molecule into a chain of amino
121//! acids. Proteins play a wide variety of roles in the function of a cell.
122//!
123//! Though there are exceptions to this rule, the core idea is this: through a
124//! series of steps described within [the central dogma of molecular
125//! biology](https://en.wikipedia.org/wiki/Central_dogma_of_molecular_biology),
126//! genetic code stored within DNA is commonly transcribed to RNA and either (a)
127//! the RNA is used as a template to assemble a functional protein through the
128//! process of translation [in the case of _coding_ RNA], or (b) that RNA plays
129//! some functional role in and of itself [in the case of _non-coding_ RNA].
130//!
131//! This crate attempts to provide facilities to effectively describe
132//! coordinates within the context of DNA molecules and RNA molecules in the
133//! various notations used within the community. We'll start with the most
134//! granular concepts (e.g., contigs, positions, and strands) and work our way
135//! up to the most broad reaching concepts (e.g., intervals and coordinate
136//! systems).
137//!
138//! ## Contigs
139//!
140//! Typically, genetic information that constitutes a genome is not stored as a
141//! single, contiguous molecule. Instead, genomes are commonly broken up into
142//! multiple, contiguous molecules of DNA known as **chromosomes**. Beyond the
143//! chromosomes, other sequences, such as the [Epstein–Barr virus][chrEBV], the
144//! [mitochondrial genome][chrMT], or decoy sequences are inserted as contigs
145//! within a reference genome to serve various purposes. This broader category
146//! of contiguous nucleotide sequences are colloquially referred to as
147//! "contigs".
148//!
149//! As we learn more about the human genome, new versions, called **genome
150//! builds** are released that describe the known genetic sequence therein. Each
151//! contigs contained within a particular genome build is assigned a unique
152//! identifier within that build (e.g., `chr1` within the `hg38` genome build).
153//! Specifying the contiguous molecule upon which a coordinate is located is the
154//! first step in anchoring the coordinate within a genome.
155//!
156//! For example, the [most recent release][t2t-genome] ([ref][t2t-publication])
157//! of the human genome at the time of writing has _exactly_ 24 contigs—these
158//! represent the 22 autosomes and the X/Y sex chromosomes present in the human
159//! genome. Interestingly, earlier versions of the human genome, such as
160//! [GRCh37][grch37-genome] and [GRCh38][grch38-genome], contain more contigs
161//! that represent phenomenon such as unplaced sequences (i.e., sequences that
162//! we know are located _somewhere_ in the human genome, but we didn't know
163//! exactly where when the reference genome was released) and unlocalized
164//! sequences (i.e., sequences where we know the chromosome upon which the
165//! sequence was located but not the exact position).
166//!
167//! #### Design Considerations
168//!
169//! There are no current or planned restrictions on what a contig can be named,
170//! as the crate needs to remain able to support all possible use cases. That
171//! said, the authors may introduce (optional) convenience methods based on
172//! common naming conventions in the future, such as the detection of `chr`
173//! prefixes, which is a convention for the naming of chromosomes specifically.
174//!
175//! ## Positions
176//!
177//! This section lays out a detailed, conceptual model within which we can
178//! compare and contrast the two kinds of positions used within genomic
179//! coordinate systems: namely, _in-base_ positions and _interbase_ positions.
180//! We then cover how these terms relate to commonly used terms in the community
181//! (including a "0-based, half-open coordinate system" and a "1-based,
182//! fully-closed coordinate system") and how you can use this crate to flexibly
183//! represent a spectrum of locations within a genome.
184//!
185//! Before we begin, a word of caution—many materials attempt to make the
186//! differences between in-base and interbase positions (or the closely related
187//! 0-based, half-open and 1-based, fully closed coordinate systems) appear
188//! small and unremarkable (e.g., by providing seemingly straightforward
189//! formulas to convert between the two). In fact, after a quick scan of these
190//! materials, you may even be tempted to view the two systems as simply a
191//! difference in accounting and off-by-one hoopla!
192//!
193//! In the authors' opinion, not only is this not true, it also doesn't serve
194//! you well to think of the coordinate systems as anything less than entirely
195//! different universes that must be explicitly and responsibly traversed
196//! between. To be clear, we're not suggesting that the existing materials are
197//! _wrong_—often, you can follow the conventions laid out, and, as long as the
198//! baked-in assumptions are consistently true for your use case, everything
199//! will be well. That said, we endeavour to go futher within this crate—to
200//! explore the very fabric of these coordinate systems, point out the
201//! assumptions made in each coordinate system, and enable you to understand and
202//! write code that works across the spectrum of possible position
203//! representations.
204//!
205//! #### In-base and Interbase Positions
206//!
207//! Positions within a genomic coordinate system can be represented as either
208//! _in-base_ positions or _interbase_ positions:
209//!
210//! * **In-base** positions point directly to and fully encapsulate a
211//! nucleotide. These types of positions are generally considered to be
212//! intuitive from a biological reasoning standpoint and are often used in
213//! contexts where data is reported back to a biological audience (e.g.,
214//! genome browsers and public variant databases). Though we use the term
215//! "in-base" exclusively in this document, these types of positions are also
216//! sometimes referred to as simply "base" positions in the broader community.
217//! * **Interbase** positions point to the spaces _between_ nucleotides. These
218//! positions are generally considered to be easier to work with
219//! computationally for a variety of reasons that will become apparent in the
220//! text that follows. It is also possible to unambiguously represent certain
221//! types of variation, such as insertions and structural variant breakpoints,
222//! using interbase positions. As such, interbase positions are commonly used
223//! as the internal representation of positions within bioinformatics tools as
224//! well as in situations where the output is meant to be consumed
225//! computationally (e.g., APIs).
226//!
227//! For example, SAM files, which are intended to be human-readable, use in-base
228//! positions to make themselves more easily interpretable and compatible with
229//! genomic databases. Their non-human-readable, binary counterparts, known as
230//! BAM files, use interbase positions for the reasons describe aboved. The
231//! decision on which coordinate system to use was largely based on the
232//! distinction on how the two file types were meant to be consumed (to learn
233//! more about what the author of SAM/BAM said about the decision, read the end
234//! of [this StackExchange
235//! answer](https://bioinformatics.stackexchange.com/a/17757)).
236//!
237//! #### Conceptual Model
238//!
239//! Here, we introduce a conceptual model that is useful for comparing and
240//! contrasting the two coordinate systems. Under this model, nucleotides and
241//! the spaces between them are pulled apart and considered to coexist as
242//! independent entities laid out along a discrete axis. Both nucleotides and
243//! spaces represent a "slot", and the kind of slot may be distinguished by
244//! designating it as a "nucleotide slot" and a "space slot" respectively.
245//! Numbered positions are assigned equidistantly at every other slot within
246//! either system, but the type of slot where positions are assigned is mutually
247//! exclusive between the two systems:
248//!
249//! * Numbered positions are assigned to each of the nucleotide slots within the
250//! in-base coordinate system.
251//! * Numbered positions are assigned to each of the space slots within the
252//! interbase coordinate system.
253//!
254//! Importantly, in both systems, **only slots with an assigned position can be
255//! specified using a position**. This has incredibly important implications on
256//! what locations can and cannot be expressed within the two coordinate
257//! systems.
258//!
259//! The diagram below depicts the model applied over a short sequence of seven
260//! nucleotides. Each slot has a series of double pipe characters (`║`) that
261//! links a slot with its assigned, numbered position (if it exists) within the
262//! in-base and interbase coordinate systems. Note that, though the two
263//! positions systems are displayed in parallel in the diagram below, that is
264//! only so that they can be compared/contrasted more easily. More specifically,
265//! **they do not interact with each other in any way**.
266//!
267//! ```text
268//! ========================== seq0 =========================
269//! • G • A • T • A • T • G • A •
270//! ║ ║ ║ ║ ║ ║ ║ ║ ║ ║ ║ ║ ║ ║ ║
271//! ║[--1--]║[--2--]║[--3--]║[--4--]║[--5--]║[--6--]║[--7--]║ In-base Positions
272//! 0 1 2 3 4 5 6 7 Interbase Positions
273//! ```
274//!
275//! As was alluded to above, reasoning about the in-base coordinate system under
276//! this model is relatively straightforward—if one wants to create a position
277//! representing the location of the first nucleotide (`G`), it can be done by
278//! simply denoting the numbered position assigned to same slot as the `G`
279//! nucleotide, which is position `1`.
280//!
281//! Creating a position that represents the same nucleotide using the interbase
282//! coordinate system is more complicated. Recall that (a) no numbered positions
283//! are assigned to nucleotide slots within the interbase coordinate system and
284//! (b) only numbered slots may be referenced as a position. As such, referring
285//! to the first nucleotide using a single, numbered position is impossible.
286//! Indeed, in a strict sense, a _range_ of numbered positions must be used to
287//! encapsulate even this single nucleotide (🤯)—namely, the range `[0-1]` (note
288//! that the range of interbase positions is generally considered _exclusive_,
289//! but that does not apply here when the space slots and nucleotide slots are
290//! split).
291//!
292//! #### Starting Position
293//!
294//! By convention within the community, interbase positions almost always start
295//! at position zero (`0`) and in-base positions almost always start at position
296//! one (`1`). As far as the authors can tell, this is for three main reasons
297//! (please contribute to the docs if you disagree with any of these assertions
298//! or know of other reasons):
299//!
300//! * **History.** Biological coordinate systems and databases have historically
301//! started with the first entity of a sequence at position `1`. Thus, in-base
302//! coordinates (which, again, are generally considered to be more suitable
303//! for a broader biological audience) tend to follow these same conventions.
304//! Because interbase positions effectively capture the space _around_ these
305//! entities, a number before one is needed to represent the space before the
306//! first entity.
307//! * **Intention.** This interplay works out well, as interbase coordinates
308//! depart from a biologically intuitive model in favor of a more
309//! computationally intuitive model. To that end, interbase positions
310//! typically mirror programming languages in that counting starts at `0`.
311//! This suggests that, many times, interbase coordinates are a more natural
312//! fit for existing data structures and algorithms.
313//! * **Convention.** Beyond the reasons above (and, further, not strictly
314//! imposed by the definitions of interbase and in-base coordinate systems),
315//! the community has evolved to use the starting position of `0` or `1` to
316//! allude to the use of interbase and in-base positions, respectively.
317//!
318//! ## Strand
319//!
320//! DNA is a double-stranded molecule that stores genetic code. This means that
321//! two sequences of complementary nucleotides run in antiparallel. This is
322//! often referred to as being read from [5' to
323//! 3'](https://en.wikipedia.org/wiki/Directionality_%28molecular_biology%29),
324//! referring to connections within the underlying chemical structure. For
325//! example, below is a fictional double-stranded molecule with the name `seq0`.
326//!
327//! ```text
328//! ---------------- Read this direction --------------->
329//!
330//! 5' 3'
331//! ===================== seq0 (+) ======================
332//! G A T A T G A A T A T G A G
333//! | | | | | | | | | | | | | |
334//! C T A T A C T T A T A C T C
335//! ===================== seq0 (-) ======================
336//! 3' 5'
337//!
338//! <--------------- Read this direction ----------------
339//! ```
340//!
341//! In a real-world, biological context, both strands contain genetic
342//! information that is important to the function of the cell—though both
343//! strands are biologically important, _some_ system of labelling must be
344//! introduced to distinguish which of the two strands a genomic coordinate is
345//! located on.
346//!
347//! To address this, a reference genome selects one of the strands as the
348//! **positive** strand (also called the "sense" strand, the "reference" strand,
349//! or the `+` strand) for each contiguous molecule. This implies that the
350//! opposite, complementary strand is the **negative** strand (also called the
351//! "antisense" strand, the "complementary" strand, or the `-` strand). Notably,
352//! reference genomes only specify the nucleotide sequence for the _positive_
353//! strand, as the negative strand's nucleotide sequence may be computed as the
354//! reverse complement of the positive strand.
355//!
356//! The concept of strandedness is useful when describing the location of
357//! coordinate on a molecule with two strands. Some nucleic acid molecules, such
358//! as RNA are single-stranded molecules—RNA is _derived_ from a particular
359//! strand of DNA, but the RNA molecule itself is not considered to be stranded.
360//!
361//! Within this crate, a [`Strand`] always refers to the strand of the
362//! coordinate upon a molecule (if the molecule is stranded). If the molecule
363//! upon which the nucleotide(s) sit is _not_ stranded, then no strand should be
364//! specified.
365//!
366//! This means that,
367//!
368//! * Coordinates that lie upon a DNA molecule must always have a strand. The
369//! [`Strand::Positive`] and [`Strand::Negative`] variants are used to
370//! distinguish which strand a coordinate sits upon relative to the strand
371//! specified in the reference genome.
372//! * Coordinates that lie upon an RNA molecule have no strand. In particular,
373//! the the original strand of DNA from which a position on RNA is derived is
374//! lost during any conversion from one to the other. If it is of interest,
375//! you may keep track of this kind of thing on your own at conversion time.
376//!
377//! ## Intervals
378//!
379//! Intervals describe a range of positions upon a contiguous molecule.
380//! Generally speaking, you can think of an interval as simply a start
381//! coordinate and end coordinate within one of the coordinate systems.
382//! Intervals are always closed _with respect to their comprising coordinates_.
383//!
384//! The following figure illustrates this concept using the notation described
385//! in [the position section of the docs](#positions).
386//!
387//! ```text
388//! ========================== seq0 ===========================
389//! • G • A • T • A • T • G • A •
390//! ║ ║ ║ ║ ║ ║ ║ ║ ║ ║ ║ ║ ║ ║ ║
391//! ║ 1 ║ 2 ║ 3 ║ 4 ║ 5 ║ 6 ║ 7 ║ In-base Positions
392//! 0 1 2 3 4 5 6 7 Interbase Positions
393//! ===========================================================
394//! ┃ ┃ ┃ ┃
395//! ┃ ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛ ┃ seq0:+:1-7 (In-base interval)
396//! ┃ Both contain "GATATGA" ┃
397//! ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛ seq0:+:0-7 (Interbase interval)
398//! ```
399//!
400//! # Crate Design
401//!
402//! Throughout the crate, you will see references to interbase and in-base
403//! variants of the concepts above. For example, there is a core [`Position`]
404//! struct that is defined like so:
405//!
406//! ```ignore
407//! pub struct Position<S>
408//! where
409//! S: System, {
410//! // private fields
411//! }
412//! ```
413
414// TODO: this is a false positive missing doc link, remove this when it gets fixed.
415#![allow(rustdoc::broken_intra_doc_links)]
416//! The struct takes a single, generic parameter that is a [`System`]. In this
417//! design, functionality that is fundamental to both interbase and in-base
418//! position types are implemented in the core [`Position`] struct.
419//! Functionality that is different between the two coordinate systems is
420//! implemented through traits (in the case of positions, [the `Position`
421//! trait](crate::position::r#trait::Position<S>)) and exposed through
422//! trait-constrained methods (e.g., [`Position::checked_add`]).
423
424//! Note that some concepts, such as [`Contig`] and [`Strand`] are coordinate
425//! system invariant. As such, they don't take a [`System`] generic type
426//! parameter.
427//!
428//! ## Learning More
429//!
430//! In the original writing of these docs, it was difficult to find a single,
431//! authoritative source regarding all of the conventions and assumptions that
432//! go into coordinate systems. Here are a few links that the authors consulted
433//! when writing this crate.
434//!
435//! * [This blog post](https://genome-blog.gi.ucsc.edu/blog/2016/12/12/the-ucsc-genome-browser-coordinate-counting-systems/)
436//! from the UCSC genome browser team does a pretty good job explaining the
437//! basics of 0-based versus 1-based coordinate systems and why they are used
438//! in different contexts.
439//! * Note that this crate does not follow the conventions UCSC uses for
440//! formatting the two coordinate systems differently (e.g. `seq0 0 1` for
441//! 0-based coordinates and `seq1:1-1`). Instead, the two coordinate
442//! systems are distinguished by the Rust type system and are serialized
443//! similarly (e.g., `seq0:+:0-1` for 0-based coordinates and `seq0:+:1-1`
444//! for 1-based coordinates).
445//! * [This blog post](https://tidyomics.com/blog/2018/12/09/2018-12-09-the-devil-0-and-1-coordinate-system-in-genomics/)
446//! also presents the two coordinate systems and gives some details about
447//! concrete file formats where each are used.
448//! * [This cheat sheet](https://www.biostars.org/p/84686/) is a popular
449//! community resource (though, you should be sure to read the comments!).
450//!
451//! [chrEBV]: https://en.wikipedia.org/wiki/Epstein%E2%80%93Barr_virus
452//! [grch37-genome]: https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.13/
453//! [grch38-genome]: https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26/
454//! [chrMT]: https://en.wikipedia.org/wiki/Mitochondrial_DNA
455//! [t2t-genome]: https://www.ncbi.nlm.nih.gov/assembly/GCF_009914755.1/
456//! [t2t-publication]: https://www.science.org/doi/10.1126/science.abj6987
457
458pub mod contig;
459pub mod coordinate;
460pub mod interval;
461pub mod math;
462pub mod position;
463pub mod strand;
464pub mod system;
465
466pub use contig::Contig;
467pub use coordinate::Coordinate;
468pub use coordinate::base;
469pub use coordinate::interbase;
470pub use interval::Interval;
471pub use position::Position;
472pub use strand::Strand;
473pub use system::System;