Crate bed_reader

source ·
Expand description

§bed-reader

github crates.io docs.rs build status

Read and write the PLINK BED format, simply and efficiently.

§Highlights

  • Fast and multi-threaded
  • Supports many indexing methods. Slice data by individuals (samples) and/or SNPs (variants).
  • The Python-facing APIs for this library is used by PySnpTools, FaST-LMM, and PyStatGen.
  • Supports PLINK 1.9.
  • Read data locally or from the cloud, efficiently and directly.

§Install

Full version: Can read local and cloud files

cargo add bed-reader

Minimal version: Can read local files, only

cargo add bed-reader --no-default-features

§Examples

Read all genotype data from a .bed file.

use ndarray as nd;
use bed_reader::{Bed, ReadOptions, assert_eq_nan, sample_bed_file};

let file_name = sample_bed_file("small.bed")?;
let mut bed = Bed::new(file_name)?;
let val = ReadOptions::builder().f64().read(&mut bed)?;

assert_eq_nan(
    &val,
    &nd::array![
        [1.0, 0.0, f64::NAN, 0.0],
        [2.0, 0.0, f64::NAN, 2.0],
        [0.0, 1.0, 2.0, 0.0]
    ],
);

Read every second individual (samples) and SNPs (variants) 20 to 30.

use ndarray::s;

let file_name = sample_bed_file("some_missing.bed")?;
let mut bed = Bed::new(file_name)?;
let val = ReadOptions::builder()
    .iid_index(s![..;2])
    .sid_index(20..30)
    .f64()
    .read(&mut bed)?;

assert!(val.dim() == (50, 10));

List the first 5 individual (sample) ids, the first 5 SNP (variant) ids, and every unique chromosome. Then, read every genomic value in chromosome 5.

use std::collections::HashSet;

let mut bed = Bed::new(file_name)?;
println!("{:?}", bed.iid()?.slice(s![..5])); // Outputs ndarray: ["iid_0", "iid_1", "iid_2", "iid_3", "iid_4"]
println!("{:?}", bed.sid()?.slice(s![..5])); // Outputs ndarray: ["sid_0", "sid_1", "sid_2", "sid_3", "sid_4"]
println!("{:?}", bed.chromosome()?.iter().collect::<HashSet<_>>());
// Outputs: {"12", "10", "4", "8", "19", "21", "9", "15", "6", "16", "13", "7", "17", "18", "1", "22", "11", "2", "20", "3", "5", "14"}
let val = ReadOptions::builder()
    .sid_index(bed.chromosome()?.map(|elem| elem == "5"))
    .f64()
    .read(&mut bed)?;

assert!(val.dim() == (100, 6));

From the cloud: open a file and read data for one SNP (variant) at index position 2. (See “Cloud URLs and CloudFile Examples” for details specifying a file in the cloud.)

use ndarray as nd;
use bed_reader::{assert_eq_nan, BedCloud, ReadOptions};
let url = "https://raw.githubusercontent.com/fastlmm/bed-sample-files/main/small.bed";
let mut bed_cloud = BedCloud::new(url).await?;
let val = ReadOptions::builder().sid_index(2).f64().read_cloud(&mut bed_cloud).await?;
assert_eq_nan(&val, &nd::array![[f64::NAN], [f64::NAN], [2.0]]);

§Main Functions

FunctionDescription
Bed::new or Bed::builderOpen a local PLINK .bed file for reading genotype data and metadata.
BedCloud::new, BedCloud::new_with_options,
BedCloud::builder, BedCloud::builder_with_options,
BedCloud::from_cloud_file, BedCloud::builder_from_cloud_file
Open a cloud PLINK .bed file for reading genotype data and metadata.
ReadOptions::builderRead genotype data from a local or cloud file. Supports indexing and options.
WriteOptions::builderWrite values to a local file in PLINK .bed format. Supports metadata and options.

§Bed Metadata Methods

After using Bed::new or Bed::builder to open a PLINK .bed file for reading, use these methods to see metadata.

MethodDescription
iid_countNumber of individuals (samples)
sid_countNumber of SNPs (variants)
dimNumber of individuals and SNPs
fidFamily id of each of individual (sample)
iidIndividual id of each of individual (sample)
fatherFather id of each of individual (sample)
motherMother id of each of individual (sample)
sexSex of each individual (sample)
phenoA phenotype for each individual (seldom used)
chromosomeChromosome of each SNP (variant)
sidSNP Id of each SNP (variant)
cm_positionCentimorgan position of each SNP (variant)
bp_positionBase-pair position of each SNP (variant)
allele_1First allele of each SNP (variant)
allele_2Second allele of each SNP (variant)
metadataAll the metadata returned as a struct.Metadata

§ReadOptions

When using ReadOptions::builder to read genotype data, use these options to specify a desired numeric type, which individuals (samples) to read, which SNPs (variants) to read, etc.

OptionDescription
i8Read values as i8
f32Read values as f32
f64Read values as f64
iid_indexIndex of individuals (samples) to read (defaults to all)
sid_indexIndex of SNPs (variants) to read (defaults to all)
fOrder of the output array, Fortran-style (default)
cOrder of the output array, C-style
is_fIs order of the output array Fortran-style? (defaults to true)
missing_valueValue to use for missing values (defaults to -127 or NaN)
count_a1Count the number allele 1 (default)
count_a2Count the number allele 2
is_a1_countedIs allele 1 counted? (defaults to true)
num_threadsNumber of threads to use (defaults to all processors)
max_concurrent_requestsMaximum number of concurrent async requests (defaults to 10) – Used by BedCloud.
max_chunk_bytesMaximum chunk size of async requests (defaults to 8_000_000 bytes) – Used by BedCloud.

§Index Expressions

Select which individuals (samples) and SNPs (variants) to read by using these iid_index and/or sid_index expressions.

ExampleTypeDescription
nothing()All
2isizeIndex position 2
-1isizeLast index position
vec![0, 10, -2]Vec<isize>Index positions 0, 10, and 2nd from last
[0, 10, -2][isize] and [isize;n]Index positions 0, 10, and 2nd from last
ndarray::array![0, 10, -2]ndarray::Array1<isize>Index positions 0, 10, and 2nd from last
10..20Range<usize>Index positions 10 (inclusive) to 20 (exclusive). Note: Rust ranges don’t support negatives
..=19RangeInclusive<usize>Index positions 0 (inclusive) to 19 (inclusive). Note: Rust ranges don’t support negatives
any Rust rangesRange*<usize>Note: Rust ranges don’t support negatives
s![10..20;2]ndarray::SliceInfo1Index positions 10 (inclusive) to 20 (exclusive) in steps of 2
s![-20..-10;-2]ndarray::SliceInfo110th from last (exclusive) to 20th from last (inclusive), in steps of -2
vec![true, false, true]Vec<bool>Index positions 0 and 2.
[true, false, true][bool] and [bool;n]Index positions 0 and 2.
ndarray::array![true, false, true]ndarray::Array1<bool>Index positions 0 and 2.

§Environment Variables

  • BED_READER_NUM_THREADS
  • NUM_THREADS

If ReadOptionsBuilder::num_threads or WriteOptionsBuilder::num_threads is not specified, the number of threads to use is determined by these environment variable (in order of priority): If neither of these environment variables are set, all processors are used.

  • BED_READER_DATA_DIR

Any requested sample file will be downloaded to this directory. If the environment variable is not set, a cache folder, appropriate to the OS, will be used.

Macros§

Structs§

Enums§

Constants§

Traits§

  • A trait alias, used internally, for the values of a .bed file, namely i8, f32, f64.
  • A trait alias, used internally, to provide default missing values for i8, f32, f64.

Functions§

  • True if and only if two 2-D arrays are equal, within a given tolerance and possibly treating NaNs as values.
  • Asserts two 2-D arrays are equal, treating NaNs as values.
  • Returns the local path to a sample .bed file. If necessary, the file will be downloaded.
  • Returns the cloud location of a sample .bed file as a URL string.
  • Returns the local path to a sample file. If necessary, the file will be downloaded.
  • Returns the local paths to a list of files. If necessary, the files will be downloaded.
  • Returns the cloud location of a sample file as a URL string.
  • Returns the cloud locations of a list of files as URL strings.

Type Aliases§