Expand description
§bcf_reader
This is an attempt to create a small, lightweight, pure-Rust library to allow efficient, cross-platform access to genotype data in BCF files.
Currently, the rust_htslib
crate works only on Linux and macOS (not Windows?).
The noodles crate is a pure Rust library for many bioinformatic file formats and
works across Windows, Linux, and macOS. However, the noodles
API for reading
genotype data from BCF files can be slow due to its memory allocation patterns.
Additionally, both crates have a large number of dependencies, as they provide
many features and support a wide range of file formats.
One way to address the memory allocation and dependency issues is to manually
parse BCF records according to its specification
(<https://samtools.github.io/hts-specs/VCFv4.2.pdf>
) and use iterators whenever
possible, especially for the per-sample fields, like GT and AD.
Note: This crate is in its early stages of development.
§Usage
use bcf_reader::*;
let mut reader = smart_reader("testdata/test2.bcf").unwrap();
let header = Header::from_string(&read_header(&mut reader).unwrap()).unwrap();
// find key for a field in INFO or FORMAT or FILTER
let key = header.get_idx_from_dictionary_str("FORMAT", "GT").unwrap();
// access header dictionary
let d = &header.dict_strings()[&key];
assert_eq!(d["ID"], "GT");
assert_eq!(d["Dictionary"], "FORMAT");
/// get chromosome name
assert_eq!(header.get_chrname(0), "Pf3D7_01_v3");
let fmt_ad_key = header
.get_idx_from_dictionary_str("FORMAT", "AD")
.expect("FORMAT/AD not found");
let info_af_key = header
.get_idx_from_dictionary_str("INFO", "AF")
.expect("INFO/AF not found");
// this can be and should be reused to reduce allocation
let mut record = Record::default();
while let Ok(_) = record.read(&mut reader) {
let pos = record.pos();
// use byte ranges and shared buffer to get allele string values
let allele_byte_ranges = record.alleles();
let share_buf = record.buf_shared();
let ref_rng = &allele_byte_ranges[0];
let ref_allele_str = std::str::from_utf8(&share_buf[ref_rng.clone()]).unwrap();
let alt1_rng = &allele_byte_ranges[1];
let alt1_allele_str = std::str::from_utf8(&share_buf[alt1_rng.clone()]).unwrap();
// ...
// access FORMAT/GT via iterator
for nv in record.fmt_gt(&header) {
let nv = nv.unwrap();
let (has_no_ploidy, is_missing, is_phased, allele_idx) = nv.gt_val();
// ...
}
// access FORMAT/AD via iterator
for nv in record.fmt_field(fmt_ad_key) {
let nv = nv.unwrap();
match nv.int_val() {
None => {}
Some(ad) => {
// ...
}
}
// ...
}
// access FILTERS via itertor
record.filters().for_each(|nv| {
let nv = nv.unwrap();
let filter_key = nv.int_val().unwrap() as usize;
let dict_string_map = &header.dict_strings()[&filter_key];
let filter_name = &dict_string_map["ID"];
// ...
});
// access INFO/AF via itertor
record.info_field_numeric(info_af_key).for_each(|nv| {
let nv = nv.unwrap();
let af = nv.float_val().unwrap();
// ...
});
}
More examples to access each field/column are available in docs of Record
and Header
.
§Reader types
- For parallelized decompression reader, see
BcfReader
. - For parallelized indexed reader, see
IndexedBcfReader
. - For the Lower-level reader underlying
BcfReader
andIndexedBcfReader
, seeParMultiGzipReader
.
§flate2
backends
By default, a rust backend is used. Other flate2
backends zlib
and
zlib-ng-compat
has been exported as the corresponding features (zlib
and
zlib-ng-compat
). See https://docs.rs/flate2/latest/flate2/ for more details.
Structs§
- BcfReader
- BcfReader suitable for read through the BCF file.
- Csi
- A struct representing CSI index file content
- CsiBin
- A bin in the CSI data structure
- CsiChunk
- A chunk within a bin in the CSI data structure
- Genome
Interval - A genome interval defined by chromosome id, start, and end positions
- Header
- Represents a header of a BCF file.
- Indexed
BcfReader - IndexedBcfReader allows random access to a specific genome interval of the
BCF file using a CSI index file. It is an wrapper around
ParMultiGzipReader<BufReader<File>>
to allow parallelizable bgzip decompression. - Numeric
Value Iter - Iterator for accessing arrays of numeric values (integers or floats) directly from the buffer bytes without building Vec<> or Vec<Vec<>> for each site.
- ParMulti
Gzip Reader - This reader facilitates parallel decompression of BCF data compressed in
the BGZF format—a specialized version of the multi-member gzip file format.
It utilizes internal buffers to sequentially ingest compressed data from
various gzip blocks, leveraging the
rayon
crate to achieve concurrent decompression. This design addresses the potential bottleneck in data processing speed that occurs when decompression is not executed in parallel, ensuring more efficient handling of compressed data streams. Example: - Parse
Gzip Header Error - Quoted
Splitter - An iterator used to split a
str
by a separator with separators within pairs of quotes ignored. - Record
- Represents a record (a line or a site) in BCF file
- Virtual
File Offsets - Virutal File offset used to jump to specific indexed bin within BCF-format genotype data separated into BGZF blocks
Enums§
- Error
- Numeric
Value - Represents a numeric value in the context of the bcf-reader.
- Parse
Header Error
Functions§
- bcf2_
typ_ width - map bcf2 type to width in bytes
- iter_
typed_ integers - Generate an iterator of numbers from a continuous bytes buffer
- read_
header - read the header lines to a String use Header::from_string(text) to convert the string into structured data
- read_
single_ typed_ integer - Read a single typed integer from the reader (of decompressed BCF buffer)
- read_
typed_ descriptor_ bytes - Read typed descriptor from the reader (of decompressed BCF buffer)
- read_
typed_ string - Read a typed string from the reader to a Rust String
- smart_
reader - Open a file from a path as a MultiGzDecoder or a BufReader depending on whether the file has the magic number for gzip (0x1f and 0x8b)