1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
//! # liblrge
//!
//! `liblrge` is a Rust library that provides utilities for estimating genome size for a given set
//! of reads.
//!
//! You can find a command-line interface (CLI) tool that uses this library in the [`lrge`][lrge] crate.
//!
//! [lrge]: https://crates.io/crates/lrge
//!
//! ## Usage
//!
//! The library provides two strategies for estimating genome size:
//!
//! ### [`TwoSetStrategy`]
//!
//! The two-set strategy uses two (random) sets of reads to estimate the genome size. The query set, which is
//! generally smaller, is overlapped against a target set of reads. A genome size estimate is generated
//! for each read in the query set, based on the number of overlaps and the average read length.
//! The median of these estimates is taken as the final genome size estimate.
//!
//! ```no_run
//! use liblrge::{Estimate, TwoSetStrategy};
//! use liblrge::twoset::{Builder, DEFAULT_TARGET_NUM_READS, DEFAULT_QUERY_NUM_READS};
//!
//! let input = "path/to/reads.fastq"; // or .fasta, .bam, .cram, .sam
//! let mut strategy = Builder::new()
//! .target_num_reads(DEFAULT_TARGET_NUM_READS)
//! .query_num_reads(DEFAULT_QUERY_NUM_READS)
//! .threads(4)
//! .build(input);
//!
//! let est_result = strategy.estimate(false, None, None).expect("Failed to generate estimate");
//! let estimate = est_result.estimate;
//! // do something with the estimate
//! ```
//!
//! ### [`AvaStrategy`]
//!
//! The all-vs-all (ava) strategy takes a (random) set of reads and overlaps it against itself to
//! estimate the genome size. The genome size estimate is generated for each read in the set, based on the
//! number of overlaps and the average read length - minus the read being assessed. The median of these
//! estimates is taken as the final genome size estimate.
//!
//! ```no_run
//! use liblrge::{Estimate, AvaStrategy};
//! use liblrge::ava::{Builder, DEFAULT_AVA_NUM_READS};
//!
//! let input = "path/to/reads.fastq"; // or .fasta, .bam, .cram, .sam
//! let mut strategy = Builder::new()
//! .num_reads(DEFAULT_AVA_NUM_READS)
//! .threads(4)
//! .build(input);
//!
//! let est_result = strategy.estimate(false, None, None).expect("Failed to generate estimate");
//! let estimate = est_result.estimate;
//! // do something with the estimate
//! ```
//!
//! ## Features
//!
//! This library includes optional support for compressed file formats and alignment formats, controlled by feature flags.
//! By default, the `compression` and `alignment` features are enabled.
//!
//! ### Available Features
//!
//! - **compression** (default): Enables all available compression formats (`gzip`, `zstd`, `bzip2`, `xz`).
//! - **alignment** (default): Enables support for unaligned BAM, CRAM, and SAM formats using the [`noodles`][noodles] crate.
//! - **gzip**: Enables support for gzip-compressed files (`.gz`) using the [`flate2`][flate2] crate.
//! - **zstd**: Enables support for zstd-compressed files (`.zst`) using the [`zstd`][zstd] crate.
//! - **bzip2**: Enables support for bzip2-compressed files (`.bz2`) using the [`bzip2`][bzip2] crate.
//! - **xz**: Enables support for xz-compressed files (`.xz`) using the [`liblzma`][xz] crate.
//!
//! ### Enabling and Disabling Features
//!
//! By default, all features are enabled. However, you can selectively enable or disable them
//! in your `Cargo.toml` to reduce dependencies or target specific formats:
//!
//! To **disable all optional features**:
//!
//! ```toml
//! liblrge = { version = "0.2.2", default-features = false }
//! ```
//!
//! To enable only specific features, list them in `Cargo.toml`:
//!
//! ```toml
//! liblrge = { version = "0.2.2", default-features = false, features = ["gzip", "alignment"] }
//! ```
//!
//! ## Format Detection
//!
//! The library uses [**magic bytes**][magic] at the start of the file to detect its compression
//! format and content type before deciding how to read it. Supported formats include:
//! - **FASTX**: FASTA and FASTQ (via `needletail`).
//! - **Alignment**: BAM, CRAM, and SAM (via `noodles`). Alignment files must be **unaligned**.
//! - **Compression**: gzip, zstd, bzip2, and xz (automatic decompression if the [appropriate feature](#features) is enabled).
//!
//! [flate2]: https://crates.io/crates/flate2
//! [zstd]: https://crates.io/crates/zstd
//! [xz]: https://crates.io/liblzma
//! [bzip2]: https://crates.io/crates/bzip2
//! [noodles]: https://crates.io/crates/noodles
//! [magic]: https://en.wikipedia.org/wiki/Magic_number_(programming)#In_files
//!
//! ## Disabling logging
//!
//! `liblrge` will output some logging information via the [`log`][log] crate. If you wish to
//! suppress this logging you can configure the logging level in your application. For example, using
//! the [`env_logger`][env_logger] crate you can do the following:
//!
//! ```
//! use log::LevelFilter;
//!
//! let mut log_builder = env_logger::Builder::new();
//! log_builder
//! .filter(None, LevelFilter::Info)
//! .filter_module("liblrge", LevelFilter::Off);
//! log_builder.init();
//!
//! // Your application code here
//! ```
//!
//! This will set the global logging level to `Info` and disable all logging from the `liblrge` library.
//!
//! [log]: https://crates.io/crates/log
//! [env_logger]: https://crates.io/crates/env_logger
//! [doi]: https://doi.org/10.1101/2024.11.27.625777
pub
pub
use StdRng;
use SeedableRng;
pub use AvaStrategy;
pub use Estimate;
pub use TwoSetStrategy;
use FromStr;
/// A type alias for `Result` with [`LrgeError`][crate::error::LrgeError] as the error type.
pub type Result<T> = Result;
/// The sequencing platform used to generate the reads.
///
/// # Examples
///
/// ```
/// use std::str::FromStr;
/// use liblrge::Platform;
///
/// for platform in ["pacbio", "pb"] {
/// assert_eq!(Platform::from_str(platform).unwrap(), Platform::PacBio);
/// }
///
/// for platform in ["nanopore", "ont"] {
/// assert_eq!(Platform::from_str(platform).unwrap(), Platform::Nanopore);
/// }
/// ```
/// Generate a shuffled list of `k` indices from 0 to `n`.
///
/// # Arguments
///
/// * `k`: The number of indices to generate.
/// * `n`: The maximum value for the range (exclusive).
/// * `seed`: An optional seed for the random number generator.
pub