Crate mzdata

Crate mzdata 

Source
Expand description

mzdata provides basic access to raw and processed mass spectrometry data formats in Rust.

For a guide, see the tutorial section.

The library currently supports reading:

  1. MGF files using MGFReader in mzdata::io::mgf
  2. mzML & indexedmzML files using MzMLReader in mzdata::io::mzml
  3. mzMLb files using MzMLbReader in mzdata::io::mzmlb, if the mzmlb feature is enabled
  4. Thermo RAW files using ThermoRawReader in mzdata::io::thermo, if the thermo feature is enabled
  5. Bruker TDF files using TDFSpectrumReader in mzdata::io::tdf, if the bruker_tdf feature is enabled

and writing:

  1. MGF files using MGFWriter in mzdata::io::mgf
  2. mzML & indexedmzML files using MzMLWriter in mzdata::io::mzml
  3. mzMLb files using MzMLbWriter in mzdata::io::mzmlb, if the mzmlb feature is enabled

This menagerie of different formats and gzip compression or not can be inferred from a path or io::Read using io::infer_format and io::infer_from_stream. Conventional dispatch is possible through MZReader. The mz_read macro provides a convenient means of working with a value with zero added overhead, but with a limited scope. The mz_write macro is the equivalent for opening a writer. There are additional tools for dealing with file format dispatch in MassSpectrometryReadWriteProcess.

It also includes a set of representation layers for spectra in mzdata::spectrum

§Example

use std::fs;
use mzdata::prelude::*;
use mzpeaks::Tolerance;
use mzdata::MZReader;
use mzdata::spectrum::SignalContinuity;

let reader = MZReader::open_path("./test/data/small.mzML").unwrap();
for spectrum in reader {
    println!("Scan {} => BP {}", spectrum.id(), spectrum.peaks().base_peak().mz);

    if spectrum.signal_continuity() == SignalContinuity::Centroid {
        let peak_picked = spectrum.into_centroid().unwrap();
        println!("Matches for 579.155: {:?}",
                 peak_picked.peaks.all_peaks_for(
                    579.155, Tolerance::Da(0.02)
                )
        );
    }
}

It uses mzpeaks to represent peaks and peak lists, and re-exports the basic types. While the high-level types are templated on simple peak types, more complex, application-specific peak types can be substituted. See mzdata::spectrum::bindata for more information about how to directly convert data arrays to peak lists.

§Traits

The library makes heavy use of traits to abstract over the implementation details of different file formats. These traits are included in mzdata::prelude. It also imports mzpeaks::prelude.

§Features

mzdata provides many optional features, some of which are self-contained, while others layer funcitonality.

TLDR: Unless you are already using ndarray-linalg in your dependency graph, you should enable mzsignal + nalgebra.

The mzsignal crate provides signal processing, peak picking, and feature finding funcitonality. Part of this behavior requires a linear algebra implementation. mzsignal is flexible. It can use either nalgebra, a pure Rust library that is self-contained but optimized for small matrices, or ndarray-linalg which requires an external LAPACK library be available either at build time or run time, all of which are outside the basic Rust ecosystem. Enabling the mzsignal feature requires one of the following features:

  • nalgebra - No external dependencies.
  • openblas - Requires OpenBlas (see https://crates.io/crates/ndarray-linalg)
  • intel-mkl - Requires Intel’s Math Kernel Library (see https://crates.io/crates/ndarray-linalg)
  • netlib - Requires the NETLIB (see https://crates.io/crates/ndarray-linalg)

§File Formats

mzdata supports reading several file formats, some of which add large dependencies and can be opted into or out of.

FeatureFile FormatDependency
mzmlbmzMLbHDF5 C shared library at runtime or statically linked with hdf5-rs, possibly a C compiler
thermoThermo-Fisher RAW Format.NET runtime at build time and runtime, possibly a C compiler
bruker_tdfBruker TDF FormatSQLite3 C library at runtime or statically linked with rusqlite, requires mzsignal for flattening spectra

Additionally, mzML and MGF are supported by default, but they can be disabled by skipping default features and not enabling the mzml and mgf features.

To complicate matters the hdf5_static feature combined with mzmlb handles statically linking the HDF5 C library and zlib together to avoid symbol collision with other compression libraries used by mzdata.

§Compression

mzdata uses flate2 to compress and decompress zlib-type compressed streams, but there are three different backends available with different tradeoffs in speed and build convenience:

  • zlib - The historical implementation. Faster than miniz_oxide and consistently produces the best compression. Requires a nearly ubiquitous C library at build time.
  • zlib-ng-compat - The fastest, often nearly best if not best compression and decompression. Requires a C library or a C compiler at build time.
  • zlib-ng - C library dependency, I encountered build errors but your mileage may vary. Requires a C library or a C compiler at build time.
  • miniz_oxide - Pure Rust backend, the slowest in practice.

mzdata was also a test-bed for some experimental compression techniques.

  • zstd - Enables layered Zstandard and byte shuffling + dictionary encoding methods.

§Async I/O

mzdata uses synchronous I/O by default, but includes code for some async options:

  • async_partial - Implements trait-level asynchronous versions of the spectrum reading traits and implementations for mzML, MGF, and Thermo RAW files using tokio, but doesn’t enable the tokio/fs module which carries additional requirements which is not compatible with all platforms.
  • async - Enables async_partial and tokio/fs.

§PROXI

mzdata includes PROXI clients for fetching spectra from supporting servers on the internet using USIs.

  • proxi - Provides a synchronous client in mzdata::io::proxi and adds mzdata::io::usi::USI::download_spectrum_blocking
  • async-proxi - Provides an asynchronous client in mzdata::io::proxi and adds mzdata::io::usi::USI::download_spectrum_async

§Other

  • serde - Enables serde serialization and deserialization for most library types that aren’t directly connected to an I/O device.
  • parallelism - Enables rayon parallel iterators on a small number of internal operations to speed up some operations relating to decompression signal processing. This is unlikely to be notice-able in most cases. More benefit is had by simply processing multiple spectra in parallel using rayon’s bridging adapters.

Re-exports§

pub use crate::io::MZReader;
pub use crate::io::MZReaderBuilder;
pub use crate::io::mgf::MGFReader;
pub use crate::io::mgf::MGFWriter;
pub use crate::io::mzml::MzMLReader;
pub use crate::io::mzml::MzMLWriter;
pub use crate::io::mzmlb::MzMLbReader;
pub use crate::io::mzmlb::MzMLbWriter;
pub use crate::io::mzmlb::MzMLbWriterBuilder;
pub use crate::params::Param;
pub use crate::params::ParamList;
pub use crate::spectrum::CentroidSpectrum;
pub use crate::spectrum::RawSpectrum;
pub use crate::spectrum::Spectrum;
pub use mzpeaks;
pub use mzsignal;

Modules§

io
Reading and writing mass spectrometry data file formats and abstractions over them.
meta
Metadata describing mass spectrometry data files and their contents.
params
Elements of controlled vocabularies used to describe mass spectra and their components.
prelude
A set of foundational traits used throughout the library.
spectrum
The data structures and components that represent a mass spectrum and how to access their data.
tutorial
A series of written introductions to specific topics in mzdata
utils

Macros§

curie
cvmap
delegate_impl_metadata_trait
Delegates the implementation of MSDataFileMetadata to a member. Passing an extra level extended token implements the optional methods.
find_param_method
impl_metadata_trait
Assumes a field for the non-Option facets of the MSDataFileMetadata implementation are present. Passing an extra level extended token implements the optional methods.
impl_param_described
Implement the ParamDescribed trait for type $t, referencing a params member of type Vec<Param>.
impl_param_described_deferred
Implement the ParamDescribed trait for type $t, referencing a params member that is an Option<Vec<Param>> that will lazily be initialized automatically when it is accessed mutably.
mz_read
A macro that dynamically works out how to get a SpectrumSource-derived object from a path or io::Read + io::Seek boxed object. This is meant to be a convenience for working with a scoped file reader without penalty.
mz_write
A macro that dynamically works out how to get a SpectrumWriter from a path or io::Write boxed object.