Molecular formulas
A Rust crate for parsing, manipulating, and analyzing molecular formulas.
It validates correctly against 120M compounds from PubChem (99.46% mass accuracy) and is fuzzed for over 10 billion iterations (see the fuzz crate) to ensure we handle all sorts of textual input.
Features
- Standard Parsing: Supports nested groups (e.g.,
C6H5(CH2)2OH), hydrates, salts, isotopes (e.g.,[13C]H4or¹³CH₄), and flexible charge notation (e.g.,Fe+3,[OH]-). - Modular AST: The internal representation allows selecting integer types (
u8,u16,u32) and enabling or disabling support for "Residuals" (wildcards) via types likeMolecularFormulavsResidualFormula. If something is missing, make a PR and we can modularly add it! - Chemical Properties:
- Check Hill System sorting conformity.
- Identify chemical classes (noble gas compounds).
- Charge: Calculate and inspect total charge.
- Composition Analysis:
- Isotopes: Check for presence of specific isotopes.
- Mixtures: Handle and inspect molecular mixtures.
- Mass Calculations:
- Monoisotopic Mass (Isotopologue mass).
- Average Molar Mass.
- Mass over Charge (m/z) ratio.
- Validation: Tested against the entire PubChem compound database (123M+ entries).
- Ecosystem:
- Built on
elements-rsfor accurate element and isotope data. - Uses
thiserrorfor ergonomic error handling. - Optional
serdesupport for serialization/deserialization.
- Built on
- Embedded Compatible:
#![no_std]capable (requiresalloc), making it suitable for WASM and embedded applications.
Installation
Add this to your Cargo.toml:
[]
= "0.1.2"
Usage
Here are some examples of how to use the library:
Basic Parsing and Properties
use FromStr;
use *;
// efficient u16 counters and i16 charge
// Note: You can use u32 or u64 for larger molecules.
let formula: ChemicalFormula = from_str.unwrap;
println!;
println!;
println!;
println!;
Complex Formulas, Hydrates and Ions
The parser handles parentheses, brackets, hydrates (dots), and charges with ease.
use FromStr;
use *;
// Copper(II) sulfate pentahydrate
let hydrate: ChemicalFormula = from_str.unwrap;
assert_eq!;
// An ion with unicode charge notation
let ion: ChemicalFormula = from_str.unwrap;
assert_eq!;
// Recursively nested groups
let complex: ChemicalFormula = from_str.unwrap;
Isotopes
You can specify isotopes using standard notation (superscripts or square brackets).
use FromStr;
use *;
// Carbon-13 labeled methane
let labeled: ChemicalFormula = from_str.unwrap;
// or
let labeled_unicode: ChemicalFormula = from_str.unwrap;
assert_eq!;
// Check if it contains specific isotopes
let c13 = try_from.unwrap;
assert!;
OCR-Resistant Parsing
The parser is designed to be robust against common OCR errors and unicode variations, handling multiple types of hyphens, dashes, and dots seamlessly.
use FromStr;
use *;
// Standard notation
let f1: ChemicalFormula = from_str.unwrap;
// OCR error: '。' (Halfwidth Ideographic Full Stop) instead of '.'
let f2: ChemicalFormula = from_str.unwrap;
assert_eq!;
// Standard charge
let c1: ChemicalFormula = from_str.unwrap;
// OCR error: Using En Dash '–' instead of Minus '-'
let c2: ChemicalFormula = from_str.unwrap;
assert_eq!;
InChI Formula Validation
The library supports strictly validated InChI-style formulas, which enforce Hill notation sorting (C first, H second, then alphabetical).
use FromStr;
use ParserError;
use *;
// Valid Hill-sorted formula
let valid: InChIFormula = from_str.unwrap;
// Invalid: Not Hill-sorted (O comes before H)
let invalid: = from_str;
assert_eq!;
Validation against PubChem
This library is tested against the PubChem database, which contains over 123 million compounds. This ensures correctness when parsing real-world chemical data.
We validate both ChemicalFormula (mass analysis) and InChIFormula (the formula layer of InChI).
Specifically, we download the CID-Mass.gz and CID-InChI-Key.gz documents, which can be found in the Extras FTP directory of PubChem.
You can run the validation suites yourself:
# Validate Mass Calculation (ChemicalFormula), takes about 55 seconds,
# most of which is just I/O time
# Validate InChI Parsing (InChIFormula), takes about 45 seconds,
# most of which is just I/O time
Validation Results (January 2026)
| Metric | Value |
|---|---|
| Total processed | 123,455,852 |
| Total time required | 58.68 s |
| Processing speed | 2,103,788 cmp/s |
| Exact matches | 66,465 |
| Within tolerance | 122,720,777 |
| Mismatches | 668,610 |
| - Ion mismatches | 106,525 |
| - Neutral mismatches | 562,085 |
| Mass accuracy (within 0.001) | 99.46% |
Note: The remaining ~0.5% mismatches are largely attributed to inconsistencies or errors in the source PubChem records rather than parsing errors.
You can find a report of the worst mismatches in worst_mismatches.md.
Benchmarks
This crate includes benchmarks to measure parsing performance for both InChIFormula and ChemicalFormula.
To run the benchmarks:
Current benchmarks cover:
- InChIFormula: Parsing a large mixture string with 76 components (~3.75 µs).
- ChemicalFormula: Parsing a complex formula with unicode subscripts, charges, and multiple elements (
C₃₉₀H₄₀₄B₂Br₂ClCs₂F₁₁K₂MnN₂₆Na₂O₁₀₀OsPdS₃W₂³⁻) (~801 ns).
Current Limitations
At this time, the parser does not support and might support in the future:
- Fractional counts (e.g.,
C1.5H3).
License
This project is licensed under the MIT License. See the LICENSE file for details.