1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
//! This crate provides parsers for sequences in FASTA and FASTQ format.
//!
//! # Libary design
//! In bioinformatics, sequences are usually stored in files in FASTA or FASTQ format,
//! which are often compressed with gzip. This makes a total of four formats: FASTA with
//! and without gzip, and FASTQ with and without gzip. The purpose of the crate is to provide
//! a parser that can automatically detect the format of the file and parse it without
//! the user having to know beforehand which format is being used. **The file format is detected
//! from the first two bytes of the file, and does not depend on the file extension**.
//!
//! We use dynamic dispatch to hide the details of the file format from the user. This introduces
//! an overhead of one dynamic dispatch per sequence, which is likely negligible unless the sequences
//! are extremely short. This also allows us to support reading from any byte stream, such as the standard
//! input, without having to attach generic parameters onto the parser. The interface is implemented for
//! the struct [reader::DynamicFastXReader]. There is also [reader::StaticFastXReader] that takes the input
//! stream as a generic parameter.
//!
//! A sequence is represented with a [record::RefRecord] struct that points to slices in the internal buffers of the reader.
//! This is to avoid allocating new memory for each sequence. There also exists [record::OwnedRecord] which owns the memory.
//!
//! Since the readers stream over the data, we can not implement the Rust Iterator trait. The lifetime constraints
//! on Rust Iterators require that all elements are valid until the end of the iteration. To support iterators,
//! we provide the [seq_db::SeqDB] struct that concatenates all sequences, headers and quality values in memory and provides
//! an iterator over them.
//!
//! # Examples
//!
//! ## Streaming all sequences in a file and printing them to the standard output.
//!
//! ```
//! use jseqio::reader::*;
//! fn main() -> Result<(), Box<dyn std::error::Error>>{
//! // Reading from a FASTQ file. Also works for FASTA,
//! // and seamlessly with/without gzip compression.
//! let mut reader = DynamicFastXReader::from_file(&"tests/data/reads.fastq.gz")?;
//! while let Some(rec) = reader.read_next().unwrap() {
//! // Headers do not include the leading '>' in FASTA or '@' in FASTQ.
//! eprintln!("Header: {}", std::str::from_utf8(rec.head)?);
//! eprintln!("Sequence: {}", std::str::from_utf8(rec.seq)?);
//! if let Some(qual) = rec.qual{
//! // Quality values are present only in fastq files.
//! eprintln!("Quality values: {}", std::str::from_utf8(qual)?);
//! }
//! }
//! Ok(())
//! }
//! ```
//!
//! ## Loading sequences into memory and computing the total length using an iterator.
//!
//! ```
//! use jseqio::reader::DynamicFastXReader;
//! fn main() -> Result<(), Box<dyn std::error::Error>>{
//! let reader = DynamicFastXReader::from_file(&"tests/data/reads.fna")?;
//! let db = reader.into_db()?;
//! let total_length = db.iter().fold(0_usize, |sum, rec| sum + rec.seq.len());
//! eprintln!("Total sequence length: {}", total_length);
//! Ok(())
//! }
//! ```
//!
use Path;
// This could be just a bool, but in the future we may want to support other compression types as well
// Returns (file type, is_gzipped)