1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
//! Coordinates upon a molecule.
//!
//! A **coordinate** is the fundamental unit for describing a location within a
//! genome. Coordinates point to a single location within a contiguous molecule
//! (typically a nucleic acid molecule, such as DNA or RNA, or a protein) and
//! are specified at the _nucleotide_ level of abstraction.
//!
//! Coordinates are comprised of three components:
//!
//! * The name of the molecule upon which the coordinate sits is known as the
//! [**contig**](crate::Contig).
//! * Each molecule is made of a contiguous series of elements. The offset of
//! the selected element with respect to the starting element of the molecule
//! is known as the [**position**](crate::Position).
//! * Optionally, if the molecule is stranded, the strand upon which the
//! coordinate sits is known as the [**strand**](crate::Strand).
//!
//! Coordinates, via their positions, can fall within the _interbase_ coordinate
//! system (which is closely related to the 0-based, half-open coordinate
//! system) or the _in-base_ coordinate system (closely related to the 1-based,
//! full-closed coordinate system). In this crate, the interbase coordinate
//! system is denoted using the `interbase`/`Interbase` identifiers, and the
//! in-base coordinate system is denoted using the `base`/`Base` identifiers (we
//! didn't like the way `in_base`/`InBase` looked).
//!
//! If you want to learn more about the supported coordinate systems, or if you
//! want to learn why this crate uses the terms that it does (e.g., "in-base"
//! instead of "1-based"), please jump to [this section](crate#positions) of the
//! docs.
//!
//! ### Scope
//!
//! At present, `omics-coordinate` is focused almost exclusively on nucleic acid
//! molecules. In the future, however, we expect to expand this to cover
//! proteins as well.
//!
//! ### Quickstart
//!
//! To get started, you'll need to decide if you want to use interbase or
//! in-base coordinates. This decision largely depends on your use case, the
//! consumers of the data, and the context of both (a) where input data is
//! coming from and (b) where output data will be shared. Note that, if you're
//! working with a common bioinformatics file format, the coordinate system is
//! often dictated by the format itself. If you need help deciding which
//! coordinate system to use, you should start by reading [the positions
//! section](#positions) of the docs.
//!
//! Once you've decided on which coordinate system you'd like to use, you can
//! create coordinates like so:
//!
//! ```
//! use omics_coordinate::Coordinate;
//! use omics_coordinate::system::Base;
//! use omics_coordinate::system::Interbase;
//!
//! // An interbase coordinate.
//! let coordinate = Coordinate::<Interbase>::try_new("seq0", "+", 0)?;
//! println!("{:#}", coordinate);
//!
//! // A in-base coordinate.
//! let coordinate = Coordinate::<Base>::try_new("seq0", "+", 1)?;
//! println!("{:#}", coordinate);
//!
//! # Ok::<(), Box<dyn std::error::Error>>(())
//! ```
//!
//! For convenience, the crate also provides type aliases for the interbase and
//! in-base variants of the relevant concepts. For example, you can use a
//! [`Position<Interbase>`] by instead simply importing a
//! [`interbase::Position`](crate::position::interbase::Position).
//!
//! ```
//! use omics_coordinate::interbase::Coordinate;
//!
//! let coordinate = Coordinate::try_new("seq0", "+", 0)?;
//! println!("{:#}", coordinate);
//!
//! # Ok::<(), Box<dyn std::error::Error>>(())
//! ```
//!
//! # Background
//!
//! Coordinate systems can be surprisingly hard to find comprehensive,
//! authoritative material for and, thus, have a reputation for being confusing
//! to newcomers to the field. To address this lack of material and to describe
//! how terms are used within this crate, the authors lay out their
//! understanding of the history behind the terminology used in the community
//! and then cover their perspective on what terms are most appropriate to be
//! used within different contexts. Notably, this may not match the worldview of
//! other popular resources or papers out there. In these cases, departures from
//! convention are noted alongside carefully reasoned opinions on why the
//! departure was made.
//!
//! ## Biology Primer
//!
//! Before diving into the coordinate system-specific details, we must first lay
//! some groundwork for terms used within genomics in general. These definitions
//! serve as a quick overview to orient you to the discussion around coordinate
//! systems—if you're interested in more detailed information, you can learn
//! more at [https://learngenomics.dev](https://learngenomics.dev).
//!
//! * A **genome** is the complete set of genetic code stored within a cell
//! ([learn more](https://www.genome.gov/genetics-glossary/Genome)).
//! * **Deoxyribose nucleic acid**, or **DNA**, is a molecule that warehouses
//! the aforementioned genetic code. In eukaryotic cells, DNA resides in the
//! nucleus of a cell.
//! * DNA is stored as a sequence of **nucleotides** (i.e., `A`, `C`, `G`,
//! and `T`).
//! * DNA is double-stranded, meaning there are two, complementary sequences
//! of nucleotides that run in antiparallel.
//! * **Ribonucleic acid**, or **RNA**, is a molecule that is _transcribed_ from
//! a particular stretch of DNA.
//! * RNA is _also_ stored as sequence of nucleotides (though, in this case,
//! the nucleotides are `A`, `C`, `G`, and `U`).
//! * RNA is single-stranded, meaning that it represents the transcription
//! of only one of the strands of DNA.
//! * RNA generally either (a) serves as a template for the production of a
//! protein or (b) has some functional role in and of itself.
//! * **Proteins** are macromolecules that are assembled by _translating_ the
//! nucleotide sequence stored with an RNA molecule into a chain of amino
//! acids. Proteins play a wide variety of roles in the function of a cell.
//!
//! Though there are exceptions to this rule, the core idea is this: through a
//! series of steps described within [the central dogma of molecular
//! biology](https://en.wikipedia.org/wiki/Central_dogma_of_molecular_biology),
//! genetic code stored within DNA is commonly transcribed to RNA and either (a)
//! the RNA is used as a template to assemble a functional protein through the
//! process of translation [in the case of _coding_ RNA], or (b) that RNA plays
//! some functional role in and of itself [in the case of _non-coding_ RNA].
//!
//! This crate attempts to provide facilities to effectively describe
//! coordinates within the context of DNA molecules and RNA molecules in the
//! various notations used within the community. We'll start with the most
//! granular concepts (e.g., contigs, positions, and strands) and work our way
//! up to the most broad reaching concepts (e.g., intervals and coordinate
//! systems).
//!
//! ## Contigs
//!
//! Typically, genetic information that constitutes a genome is not stored as a
//! single, contiguous molecule. Instead, genomes are commonly broken up into
//! multiple, contiguous molecules of DNA known as **chromosomes**. Beyond the
//! chromosomes, other sequences, such as the [Epstein–Barr virus][chrEBV], the
//! [mitochondrial genome][chrMT], or decoy sequences are inserted as contigs
//! within a reference genome to serve various purposes. This broader category
//! of contiguous nucleotide sequences are colloquially referred to as
//! "contigs".
//!
//! As we learn more about the human genome, new versions, called **genome
//! builds** are released that describe the known genetic sequence therein. Each
//! contigs contained within a particular genome build is assigned a unique
//! identifier within that build (e.g., `chr1` within the `hg38` genome build).
//! Specifying the contiguous molecule upon which a coordinate is located is the
//! first step in anchoring the coordinate within a genome.
//!
//! For example, the [most recent release][t2t-genome] ([ref][t2t-publication])
//! of the human genome at the time of writing has _exactly_ 24 contigs—these
//! represent the 22 autosomes and the X/Y sex chromosomes present in the human
//! genome. Interestingly, earlier versions of the human genome, such as
//! [GRCh37][grch37-genome] and [GRCh38][grch38-genome], contain more contigs
//! that represent phenomenon such as unplaced sequences (i.e., sequences that
//! we know are located _somewhere_ in the human genome, but we didn't know
//! exactly where when the reference genome was released) and unlocalized
//! sequences (i.e., sequences where we know the chromosome upon which the
//! sequence was located but not the exact position).
//!
//! #### Design Considerations
//!
//! There are no current or planned restrictions on what a contig can be named,
//! as the crate needs to remain able to support all possible use cases. That
//! said, the authors may introduce (optional) convenience methods based on
//! common naming conventions in the future, such as the detection of `chr`
//! prefixes, which is a convention for the naming of chromosomes specifically.
//!
//! ## Positions
//!
//! This section lays out a detailed, conceptual model within which we can
//! compare and contrast the two kinds of positions used within genomic
//! coordinate systems: namely, _in-base_ positions and _interbase_ positions.
//! We then cover how these terms relate to commonly used terms in the community
//! (including a "0-based, half-open coordinate system" and a "1-based,
//! fully-closed coordinate system") and how you can use this crate to flexibly
//! represent a spectrum of locations within a genome.
//!
//! Before we begin, a word of caution—many materials attempt to make the
//! differences between in-base and interbase positions (or the closely related
//! 0-based, half-open and 1-based, fully closed coordinate systems) appear
//! small and unremarkable (e.g., by providing seemingly straightforward
//! formulas to convert between the two). In fact, after a quick scan of these
//! materials, you may even be tempted to view the two systems as simply a
//! difference in accounting and off-by-one hoopla!
//!
//! In the authors' opinion, not only is this not true, it also doesn't serve
//! you well to think of the coordinate systems as anything less than entirely
//! different universes that must be explicitly and responsibly traversed
//! between. To be clear, we're not suggesting that the existing materials are
//! _wrong_—often, you can follow the conventions laid out, and, as long as the
//! baked-in assumptions are consistently true for your use case, everything
//! will be well. That said, we endeavour to go futher within this crate—to
//! explore the very fabric of these coordinate systems, point out the
//! assumptions made in each coordinate system, and enable you to understand and
//! write code that works across the spectrum of possible position
//! representations.
//!
//! #### In-base and Interbase Positions
//!
//! Positions within a genomic coordinate system can be represented as either
//! _in-base_ positions or _interbase_ positions:
//!
//! * **In-base** positions point directly to and fully encapsulate a
//! nucleotide. These types of positions are generally considered to be
//! intuitive from a biological reasoning standpoint and are often used in
//! contexts where data is reported back to a biological audience (e.g.,
//! genome browsers and public variant databases). Though we use the term
//! "in-base" exclusively in this document, these types of positions are also
//! sometimes referred to as simply "base" positions in the broader community.
//! * **Interbase** positions point to the spaces _between_ nucleotides. These
//! positions are generally considered to be easier to work with
//! computationally for a variety of reasons that will become apparent in the
//! text that follows. It is also possible to unambiguously represent certain
//! types of variation, such as insertions and structural variant breakpoints,
//! using interbase positions. As such, interbase positions are commonly used
//! as the internal representation of positions within bioinformatics tools as
//! well as in situations where the output is meant to be consumed
//! computationally (e.g., APIs).
//!
//! For example, SAM files, which are intended to be human-readable, use in-base
//! positions to make themselves more easily interpretable and compatible with
//! genomic databases. Their non-human-readable, binary counterparts, known as
//! BAM files, use interbase positions for the reasons describe aboved. The
//! decision on which coordinate system to use was largely based on the
//! distinction on how the two file types were meant to be consumed (to learn
//! more about what the author of SAM/BAM said about the decision, read the end
//! of [this StackExchange
//! answer](https://bioinformatics.stackexchange.com/a/17757)).
//!
//! #### Conceptual Model
//!
//! Here, we introduce a conceptual model that is useful for comparing and
//! contrasting the two coordinate systems. Under this model, nucleotides and
//! the spaces between them are pulled apart and considered to coexist as
//! independent entities laid out along a discrete axis. Both nucleotides and
//! spaces represent a "slot", and the kind of slot may be distinguished by
//! designating it as a "nucleotide slot" and a "space slot" respectively.
//! Numbered positions are assigned equidistantly at every other slot within
//! either system, but the type of slot where positions are assigned is mutually
//! exclusive between the two systems:
//!
//! * Numbered positions are assigned to each of the nucleotide slots within the
//! in-base coordinate system.
//! * Numbered positions are assigned to each of the space slots within the
//! interbase coordinate system.
//!
//! Importantly, in both systems, **only slots with an assigned position can be
//! specified using a position**. This has incredibly important implications on
//! what locations can and cannot be expressed within the two coordinate
//! systems.
//!
//! The diagram below depicts the model applied over a short sequence of seven
//! nucleotides. Each slot has a series of double pipe characters (`║`) that
//! links a slot with its assigned, numbered position (if it exists) within the
//! in-base and interbase coordinate systems. Note that, though the two
//! positions systems are displayed in parallel in the diagram below, that is
//! only so that they can be compared/contrasted more easily. More specifically,
//! **they do not interact with each other in any way**.
//!
//! ```text
//! ========================== seq0 =========================
//! • G • A • T • A • T • G • A •
//! ║ ║ ║ ║ ║ ║ ║ ║ ║ ║ ║ ║ ║ ║ ║
//! ║[--1--]║[--2--]║[--3--]║[--4--]║[--5--]║[--6--]║[--7--]║ In-base Positions
//! 0 1 2 3 4 5 6 7 Interbase Positions
//! ```
//!
//! As was alluded to above, reasoning about the in-base coordinate system under
//! this model is relatively straightforward—if one wants to create a position
//! representing the location of the first nucleotide (`G`), it can be done by
//! simply denoting the numbered position assigned to same slot as the `G`
//! nucleotide, which is position `1`.
//!
//! Creating a position that represents the same nucleotide using the interbase
//! coordinate system is more complicated. Recall that (a) no numbered positions
//! are assigned to nucleotide slots within the interbase coordinate system and
//! (b) only numbered slots may be referenced as a position. As such, referring
//! to the first nucleotide using a single, numbered position is impossible.
//! Indeed, in a strict sense, a _range_ of numbered positions must be used to
//! encapsulate even this single nucleotide (🤯)—namely, the range `[0-1]` (note
//! that the range of interbase positions is generally considered _exclusive_,
//! but that does not apply here when the space slots and nucleotide slots are
//! split).
//!
//! #### Starting Position
//!
//! By convention within the community, interbase positions almost always start
//! at position zero (`0`) and in-base positions almost always start at position
//! one (`1`). As far as the authors can tell, this is for three main reasons
//! (please contribute to the docs if you disagree with any of these assertions
//! or know of other reasons):
//!
//! * **History.** Biological coordinate systems and databases have historically
//! started with the first entity of a sequence at position `1`. Thus, in-base
//! coordinates (which, again, are generally considered to be more suitable
//! for a broader biological audience) tend to follow these same conventions.
//! Because interbase positions effectively capture the space _around_ these
//! entities, a number before one is needed to represent the space before the
//! first entity.
//! * **Intention.** This interplay works out well, as interbase coordinates
//! depart from a biologically intuitive model in favor of a more
//! computationally intuitive model. To that end, interbase positions
//! typically mirror programming languages in that counting starts at `0`.
//! This suggests that, many times, interbase coordinates are a more natural
//! fit for existing data structures and algorithms.
//! * **Convention.** Beyond the reasons above (and, further, not strictly
//! imposed by the definitions of interbase and in-base coordinate systems),
//! the community has evolved to use the starting position of `0` or `1` to
//! allude to the use of interbase and in-base positions, respectively.
//!
//! ## Strand
//!
//! DNA is a double-stranded molecule that stores genetic code. This means that
//! two sequences of complementary nucleotides run in antiparallel. This is
//! often referred to as being read from [5' to
//! 3'](https://en.wikipedia.org/wiki/Directionality_%28molecular_biology%29),
//! referring to connections within the underlying chemical structure. For
//! example, below is a fictional double-stranded molecule with the name `seq0`.
//!
//! ```text
//! ---------------- Read this direction --------------->
//!
//! 5' 3'
//! ===================== seq0 (+) ======================
//! G A T A T G A A T A T G A G
//! | | | | | | | | | | | | | |
//! C T A T A C T T A T A C T C
//! ===================== seq0 (-) ======================
//! 3' 5'
//!
//! <--------------- Read this direction ----------------
//! ```
//!
//! In a real-world, biological context, both strands contain genetic
//! information that is important to the function of the cell—though both
//! strands are biologically important, _some_ system of labelling must be
//! introduced to distinguish which of the two strands a genomic coordinate is
//! located on.
//!
//! To address this, a reference genome selects one of the strands as the
//! **positive** strand (also called the "sense" strand, the "reference" strand,
//! or the `+` strand) for each contiguous molecule. This implies that the
//! opposite, complementary strand is the **negative** strand (also called the
//! "antisense" strand, the "complementary" strand, or the `-` strand). Notably,
//! reference genomes only specify the nucleotide sequence for the _positive_
//! strand, as the negative strand's nucleotide sequence may be computed as the
//! reverse complement of the positive strand.
//!
//! The concept of strandedness is useful when describing the location of
//! coordinate on a molecule with two strands. Some nucleic acid molecules, such
//! as RNA are single-stranded molecules—RNA is _derived_ from a particular
//! strand of DNA, but the RNA molecule itself is not considered to be stranded.
//!
//! Within this crate, a [`Strand`] always refers to the strand of the
//! coordinate upon a molecule (if the molecule is stranded). If the molecule
//! upon which the nucleotide(s) sit is _not_ stranded, then no strand should be
//! specified.
//!
//! This means that,
//!
//! * Coordinates that lie upon a DNA molecule must always have a strand. The
//! [`Strand::Positive`] and [`Strand::Negative`] variants are used to
//! distinguish which strand a coordinate sits upon relative to the strand
//! specified in the reference genome.
//! * Coordinates that lie upon an RNA molecule have no strand. In particular,
//! the the original strand of DNA from which a position on RNA is derived is
//! lost during any conversion from one to the other. If it is of interest,
//! you may keep track of this kind of thing on your own at conversion time.
//!
//! ## Intervals
//!
//! Intervals describe a range of positions upon a contiguous molecule.
//! Generally speaking, you can think of an interval as simply a start
//! coordinate and end coordinate within one of the coordinate systems.
//! Intervals are always closed _with respect to their comprising coordinates_.
//!
//! The following figure illustrates this concept using the notation described
//! in [the position section of the docs](#positions).
//!
//! ```text
//! ========================== seq0 ===========================
//! • G • A • T • A • T • G • A •
//! ║ ║ ║ ║ ║ ║ ║ ║ ║ ║ ║ ║ ║ ║ ║
//! ║ 1 ║ 2 ║ 3 ║ 4 ║ 5 ║ 6 ║ 7 ║ In-base Positions
//! 0 1 2 3 4 5 6 7 Interbase Positions
//! ===========================================================
//! ┃ ┃ ┃ ┃
//! ┃ ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛ ┃ seq0:+:1-7 (In-base interval)
//! ┃ Both contain "GATATGA" ┃
//! ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛ seq0:+:0-7 (Interbase interval)
//! ```
//!
//! # Crate Design
//!
//! Throughout the crate, you will see references to interbase and in-base
//! variants of the concepts above. For example, there is a core [`Position`]
//! struct that is defined like so:
//!
//! ```ignore
//! pub struct Position<S>
//! where
//! S: System, {
//! // private fields
//! }
//! ```
//!
//! The struct takes a single, generic parameter that is a [`System`]. In this
//! design, functionality that is fundamental to both interbase and in-base
//! position types are implemented in the core [`Position`] struct.
//! Functionality that is different between the two coordinate systems is
//! implemented through traits (in the case of positions, the [`Position`
//! trait](crate::position::trait::Position)) and exposed through
//! trait-constrained methods (e.g., [`Position::checked_add`]).
//! Note that some concepts, such as [`Contig`] and [`Strand`] are coordinate
//! system invariant. As such, they don't take a [`System`] generic type
//! parameter.
//!
//! ## Learning More
//!
//! In the original writing of these docs, it was difficult to find a single,
//! authoritative source regarding all of the conventions and assumptions that
//! go into coordinate systems. Here are a few links that the authors consulted
//! when writing this crate.
//!
//! * [This blog post](https://genome-blog.gi.ucsc.edu/blog/2016/12/12/the-ucsc-genome-browser-coordinate-counting-systems/)
//! from the UCSC genome browser team does a pretty good job explaining the
//! basics of 0-based versus 1-based coordinate systems and why they are used
//! in different contexts.
//! * Note that this crate does not follow the conventions UCSC uses for
//! formatting the two coordinate systems differently (e.g. `seq0 0 1` for
//! 0-based coordinates and `seq1:1-1`). Instead, the two coordinate
//! systems are distinguished by the Rust type system and are serialized
//! similarly (e.g., `seq0:+:0-1` for 0-based coordinates and `seq0:+:1-1`
//! for 1-based coordinates).
//! * [This blog post](https://tidyomics.com/blog/2018/12/09/2018-12-09-the-devil-0-and-1-coordinate-system-in-genomics/)
//! also presents the two coordinate systems and gives some details about
//! concrete file formats where each are used.
//! * [This cheat sheet](https://www.biostars.org/p/84686/) is a popular
//! community resource (though, you should be sure to read the comments!).
//!
//! [chrEBV]: https://en.wikipedia.org/wiki/Epstein%E2%80%93Barr_virus
//! [grch37-genome]: https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.13/
//! [grch38-genome]: https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26/
//! [chrMT]: https://en.wikipedia.org/wiki/Mitochondrial_DNA
//! [t2t-genome]: https://www.ncbi.nlm.nih.gov/assembly/GCF_009914755.1/
//! [t2t-publication]: https://www.science.org/doi/10.1126/science.abj6987
pub use Contig;
pub use Coordinate;
pub use base;
pub use interbase;
pub use Interval;
pub use Position;
pub use Strand;
pub use System;