Expand description
§Match those fragments!
Handle mass spectrometry data in Rust. This crate is set up to handle very complex peptides with
loads of ambiguity and complexity. It pivots around the CompoundPeptidoformIon
, PeptidoformIon
and Peptidoform
which encode the ProForma specification. Additionally
this crate enables the reading of mgf, doing spectrum annotation
(BU/MD/TD), finding isobaric sequences, doing alignments of peptides
, accessing the IMGT germline database, and reading identified peptide files.
§Library features
- Read ProForma sequences (complete specification supported: ‘level 2-ProForma + top-down compliant + cross-linking compliant + glycans compliant + mass spectrum compliant’)
- Generate theoretical fragments with control over the fragmentation model from any ProForma peptidoform/proteoform
- Generate theoretical fragments for chimeric spectra
- Generate theoretical fragments for cross-links (also disulfides)
- Generate theoretical fragments for modifications of unknown position
- Generate peptide backbone (a, b, c, x, y, and z) and satellite ion fragments (w, d, and v)
- Generate glycan fragments (B, Y, and internal fragments)
- Integrated with mzdata for reading raw data files
- Match spectra to the generated fragments
- Align peptides based on mass
- Fast access to the IMGT database of antibody germlines
- Reading of multiple identified peptide file formats (Fasta, MaxQuant, MSFragger, Novor, OPair, Peaks, and Sage)
- Exhaustively fuzz tested for reliability (using cargo-afl)
- Extensive use of uom for compile time unit checking
§Example usage
use rustyms::{*, model::*, system::{usize::Charge, e}};
// Open example raw data (this is the built in mgf reader, look into mzdata for more advanced raw file readers)
let spectrum = rawfile::mgf::open(raw_file_path)?;
// Parse the given ProForma definition
let peptide = CompoundPeptidoformIon::pro_forma("[Gln->pyro-Glu]-QVQEVSERTHGGNFD", None)?;
// Generate theoretical fragments for this peptide given EThcD fragmentation
let model = FragmentationModel::ethcd();
let fragments = peptide.generate_theoretical_fragments(Charge::new::<e>(2), model);
let parameters = MatchingParameters::default();
// Annotate the raw data with the theoretical fragments
let annotated = spectrum[0].annotate(peptide, &fragments, ¶meters, MassMode::Monoisotopic);
// Calculate a peak false discovery rate for this annotation
let (fdr, _) = annotated.fdr(&fragments, ¶meters, MassMode::Monoisotopic);
// This is the incorrect sequence for this spectrum so the peak FDR will indicate this
assert!(fdr.peaks_sigma() > 2.0);
use rustyms::{*, align::*};
// Check how this peptide compares to a similar peptide (using the feature `align`)
let first_peptide = Peptidoform::pro_forma("IVQEVT", None)?.into_simple_linear().unwrap();
let second_peptide = Peptidoform::pro_forma("LVQVET", None)?.into_simple_linear().unwrap();
// Align the two peptides using mass based alignment
// IVQEVT A
// LVQVET B
// ─ ╶╴
let alignment = align::<4, SimpleLinear, SimpleLinear>(
&first_peptide,
&second_peptide,
AlignScoring::default(),
AlignType::GLOBAL);
// Calculate some more statistics on this alignment
let stats = alignment.stats();
assert_eq!(stats.mass_similar, 6); // 6 out of the 6 positions are mass similar
§Compilation features
Rustyms ties together multiple smaller modules into one cohesive structure. It has multiple features which allow you to slim it down if needed (all are enabled by default).
align
- gives access to mass based alignment of peptides.identification
- gives access to methods reading many different identified peptide formats.imgt
- enables access to the IMGT database of antibodies germline sequences, with annotations.isotopes
- gives access to generation of an averagine model for isotopes, also enables two additional dependencies.rand
- allows the generation of random peptides.rayon
- enables parallel iterators using rayon, mostly forimgt
but also in consecutive align.mzdata
- enables integration with mzdata which has more advanced raw file support.glycan-render
- enables the rendering to SVGs for glycans and glycan fragmentsglycan-render-bitmap
- enables the rendering to bitmaps for glycans, by enabling the optional dependencies zeno and swash
Re-exports§
pub use aminoacid::AminoAcid;
pub use aminoacid::IsAminoAcid;
pub use fragment::Fragment;
pub use model::FragmentationModel;
pub use modification::CrossLinkName;
pub use modification::Modification;
pub use peptidoform::CompoundPeptidoformIon;
pub use peptidoform::Peptidoform;
pub use peptidoform::PeptidoformIon;
pub use spectrum::AnnotatableSpectrum;
pub use spectrum::AnnotatedSpectrum;
pub use spectrum::RawSpectrum;
pub use peptidoform::*;
Modules§
- align
- Only available with feature
align
. Code to make alignments of two peptides based on mass mistakes, and genetic information. - aminoacid
- Contains logic surrounding amino acids, see
AminoAcid
for the main structure. - error
- Contain the definition for errors with all additional data that is needed to generate nice error messages
- fragment
- Handle fragment related issues, access provided if you want to dive deeply into fragments in your own code.
- glycan
- Handle glycan related issues, access provided if you want to work with glycans on your own.
- identification
- Only available with feature
identification
. Read in the annotations from peptide identification sources - imgt
- Only available with feature
imgt
. This crate handles parsing the IMGT LIGM-DB database into structures compatible with rustyms. It additionally stores all regions and annotations. There are two main ways of selecting germline(s), specified by nameget_germline
or by building a query over the dataSelection
. - model
- Handle parameters for fragmentation and matching
- modification
- Handle modification related issues, access provided if you want to dive deeply into modifications in your own code.
- ontologies
- The available ontologies
- peptidoform
- Module concerned with peptide related processing
- placement_
rule - Rules regarding the placement of modifications
- rawfile
- Handling raw files
- spectrum
- Spectrum related code
- system
- The measurement system used in this crate. A redefinition of the important SI units for them to be stored in a more sensible base unit for MS purposes.
Macros§
- Q
- Macro to implement
quantity
type aliases for a specific system of units and value storage type. - molecular_
formula - Easily define molecular formulas using the following syntax:
<element> <num>
or[<isotope> <element> <num>]
. The spaces are required by the Rust compiler.
Structs§
- Checked
Amino Acid - A checked amino acid. This wraps an
AminoAcid
to keep track of the maximal complexity of the underlying amino acid. Any marked asSemiAmbiguous
or higher can contain B/Z (ambiguous asparagine/glutamine) while any marked asUnAmbiguous
can only contain amino acids with a single defined chemical formula. - Diagnostic
Ion - A diagnostic ion, defined in M (not MH+) chemical formula
- Molecular
Charge - A selection of ions that together define the charge of a peptide
- Molecular
Formula - A molecular formula, a selection of elements of specified isotopes together forming a structure
- Multi
- A collection of potentially multiple of the generic type, it is used be able to easily combine multiple of this multi struct into all possible combinations.
- Protease
- A protease defined by it ability to cut at any site identified by the right amino acids at the n and c terminal.
Each position is identified by an option, a none means that there is no specificity at this position. If there is
a specificity at a certain position any amino acid that is contained in the set is allowed (see
crate::CheckedAminoAcid::canonical_identical
). - Sequence
Element - One block in a sequence meaning an aminoacid and its accompanying modifications
Enums§
- Ambiguous
Label - Keep track of what ambiguous option is used
- Element
- The elements (and electrons)
- Mass
Mode - The mode of mass to use
- Neutral
Loss - All possible neutral losses
- Sequence
Position - A position on a sequence
- Tolerance
- A tolerance around a given unit for searching purposes
Constants§
- COMMON_
ELEMENT_ PARSE_ LIST - Most common elements selected for abundance in biological systems as well as unambiguous parsing. Sorted so that single characters come after two character element symbols (needed for greedy parsing).
- ELEMENT_
PARSE_ LIST - All elements sorted so that single characters come after two character element symbols (needed for greedy parsing)
Statics§
- ELEMENTAL_
DATA - Get the elemental data
Traits§
- Chemical
- Any item that has a clearly defined single molecular formula
- Multi
Chemical - Any item that has a number of potential chemical formulas
- Within
Tolerance - Check if two values are within the specified tolerance from each other.
Functions§
- building_
blocks - Get the possible building blocks for sequences based on the given modifications. Useful for any automated sequence generation, like isobaric set generation or de novo sequencing. The result is for each location (N term, center, C term) the list of all possible building blocks with its mass, sorted on mass.
- find_
formulas - Find the molecular formulas that fit the mass given the tolerance using only the provided elements.
- find_
isobaric_ sets - Find the isobaric sets for the given mass with the given modifications and ppm error.
The modifications are placed on any location they are allowed based on the given placement
rules, so using any modifications which provide those is advised. If the provided
Peptidoform
has multiple formulas, it uses the formula with the lowest monoisotopic mass.