Skip to main content

Crate molecular_formulas

Crate molecular_formulas 

Source
Expand description

§Molecular formulas

CI License: MIT Codecov Crates.io Docs.rs

A Rust crate for parsing, manipulating, and analyzing molecular formulas.

It validates correctly against 120M compounds from PubChem (99.46% mass accuracy) and is fuzzed for over 10 billion iterations (see the fuzz crate) to ensure we handle all sorts of textual input.

§Features

  • Standard Parsing: Supports nested groups (e.g., C6H5(CH2)2OH), hydrates, salts, isotopes (e.g., [13C]H4 or ¹³CH₄), and flexible charge notation (e.g., Fe+3, [OH]-).
  • Modular AST: The internal representation allows selecting integer types (u8, u16, u32) and enabling or disabling support for “Residuals” (wildcards) via types like MolecularFormula vs ResidualFormula. If something is missing, make a PR and we can modularly add it!
  • Chemical Properties:
    • Check Hill System sorting conformity.
    • Identify chemical classes (noble gas compounds).
    • Charge: Calculate and inspect total charge.
  • Composition Analysis:
    • Isotopes: Check for presence of specific isotopes.
    • Mixtures: Handle and inspect molecular mixtures.
  • Mass Calculations:
    • Monoisotopic Mass (Isotopologue mass).
    • Average Molar Mass.
    • Mass over Charge (m/z) ratio.
  • Validation: Tested against the entire PubChem compound database (123M+ entries).
  • Ecosystem:
    • Built on elements-rs for accurate element and isotope data.
    • Uses thiserror for ergonomic error handling.
    • Optional serde support for serialization/deserialization.
  • Embedded Compatible: #![no_std] capable (requires alloc), making it suitable for WASM and embedded applications.

§Installation

Add this to your Cargo.toml:

[dependencies]
molecular-formulas = "0.1.2"

§Usage

Here are some examples of how to use the library:

§Basic Parsing and Properties

use std::str::FromStr;
use molecular_formulas::prelude::*;

// efficient u16 counters and i16 charge
// Note: You can use u32 or u64 for larger molecules.
let formula: ChemicalFormula = ChemicalFormula::from_str("C6H12O6").unwrap();

println!("Formula: {}", formula);
println!("Monoisotopic Mass: {} Da", formula.isotopologue_mass());
println!("Average Mass: {} Da", formula.molar_mass());
println!("Charge: {}", formula.charge());

§Complex Formulas, Hydrates and Ions

The parser handles parentheses, brackets, hydrates (dots), and charges with ease.

use std::str::FromStr;
use molecular_formulas::prelude::*;

// Copper(II) sulfate pentahydrate
let hydrate: ChemicalFormula = ChemicalFormula::from_str("CuSO4.5H2O").unwrap();
assert_eq!(hydrate.to_string(), "CuSO₄.5H₂O");

// An ion with unicode charge notation
let ion: ChemicalFormula = ChemicalFormula::from_str("SO₄²⁻").unwrap();
assert_eq!(ion.charge(), -2.0);

// Recursively nested groups
let complex: ChemicalFormula = ChemicalFormula::from_str("[Co(NH3)5Cl]Cl2").unwrap();

§Isotopes

You can specify isotopes using standard notation (superscripts or square brackets).

use std::str::FromStr;
use molecular_formulas::prelude::*;

// Carbon-13 labeled methane
let labeled: ChemicalFormula = ChemicalFormula::from_str("[13C]H4").unwrap();
// or
let labeled_unicode: ChemicalFormula = ChemicalFormula::from_str("¹³CH₄").unwrap();

assert_eq!(labeled, labeled_unicode);

// Check if it contains specific isotopes
let c13 = Isotope::try_from((Element::C, 13_u16)).unwrap();
assert!(labeled.contains_isotope(c13));

§OCR-Resistant Parsing

The parser is designed to be robust against common OCR errors and unicode variations, handling multiple types of hyphens, dashes, and dots seamlessly.

use std::str::FromStr;
use molecular_formulas::prelude::*;

// Standard notation
let f1: ChemicalFormula = ChemicalFormula::from_str("CuSO4.5H2O").unwrap();
// OCR error: '。' (Halfwidth Ideographic Full Stop) instead of '.'
let f2: ChemicalFormula = ChemicalFormula::from_str("CuSO4。5H2O").unwrap();
assert_eq!(f1, f2);

// Standard charge
let c1: ChemicalFormula = ChemicalFormula::from_str("SO4-2").unwrap();
// OCR error: Using En Dash '–' instead of Minus '-'
let c2: ChemicalFormula = ChemicalFormula::from_str("SO4–2").unwrap();
assert_eq!(c1, c2);

§InChI Formula Validation

The library supports strictly validated InChI-style formulas, which enforce Hill notation sorting (C first, H second, then alphabetical).

use std::str::FromStr;
use molecular_formulas::errors::ParserError;
use molecular_formulas::prelude::*;

// Valid Hill-sorted formula
let valid: InChIFormula = InChIFormula::from_str("C2H5O").unwrap();

// Invalid: Not Hill-sorted (O comes before H)
let invalid: Result<InChIFormula, _> = InChIFormula::from_str("C2OH5");
assert_eq!(invalid.unwrap_err(), ParserError::NotHillOrdered);

§Validation against PubChem

This library is tested against the PubChem database, which contains over 123 million compounds. This ensures correctness when parsing real-world chemical data.

We validate both ChemicalFormula (mass analysis) and InChIFormula (the formula layer of InChI).

Specifically, we download the CID-Mass.gz and CID-InChI-Key.gz documents, which can be found in the Extras FTP directory of PubChem.

You can run the validation suites yourself:

# Validate Mass Calculation (ChemicalFormula), takes about 55 seconds,
# most of which is just I/O time
cargo test --release --test test_pubchem_validation -- --ignored --nocapture

# Validate InChI Parsing (InChIFormula), takes about 45 seconds,
# most of which is just I/O time
cargo test --release --test test_pubchem_inchi_validation -- --ignored --nocapture

§Validation Results (January 2026)

MetricValue
Total processed123,455,852
Total time required58.68 s
Processing speed2,103,788 cmp/s
Exact matches66,465
Within tolerance122,720,777
Mismatches668,610
- Ion mismatches106,525
- Neutral mismatches562,085
Mass accuracy (within 0.001)99.46%

Note: The remaining ~0.5% mismatches are largely attributed to inconsistencies or errors in the source PubChem records rather than parsing errors.

You can find a report of the worst mismatches in worst_mismatches.md.

§Benchmarks

This crate includes benchmarks to measure parsing performance for both InChIFormula and ChemicalFormula.

To run the benchmarks:

cargo bench

Current benchmarks cover:

  • InChIFormula: Parsing a large mixture string with 76 components (~3.75 µs).
  • ChemicalFormula: Parsing a complex formula with unicode subscripts, charges, and multiple elements (C₃₉₀H₄₀₄B₂Br₂ClCs₂F₁₁K₂MnN₂₆Na₂O₁₀₀OsPdS₃W₂³⁻) (~801 ns).

§Current Limitations

At this time, the parser does not support and might support in the future:

  • Fractional counts (e.g., C1.5H3).

§License

This project is licensed under the MIT License. See the LICENSE file for details.

Re-exports§

pub use molecular_formula::*;
pub use nodes::*;
pub use parsable::*;

Modules§

errors
Submodule defining the error enumeration which might occur when working with molecular formula.
molecular_formula
Properties that can be computed from molecular formulas.
nodes
Submodule defining an ExtensionTree trait, including its associated types regarding parsing. For instance, the ChargeExtensionTree trait defines the tokens, subtokens and allowed characters for formulas which can contain charges, which is common in several contexts but forbidden in InChI strings.
parsable
Submodule defining a parsable entity.
prelude
Prelude module re-exporting commonly used items.

Traits§

ChargedMolecularTree
Trait for molecular trees which can hold a charge.
MolecularTree
Trait for computing various molecular properties.

Functions§

is_hill_sorted_pair
Helper to check if two elements are in Hill order.