Molecular formulas
A Rust crate for parsing, manipulating, and analyzing molecular formulas.
It validates correctly against 120M compounds from PubChem (99.46% mass accuracy) and is fuzzed for over 10 billion iterations (see the fuzz crate) to ensure we handle all sorts of textual input.
Features
molecular-formulas supports nested groups, hydrates, salts, isotope notation, flexible charge notation, and strict InChI-style formula validation. It provides typed formula variants for general chemical formulas, Hill-sorted InChI formula layers, residual groups, and mineral polymorph prefixes. The crate can inspect elements, isotopes, mixtures, Hill ordering, charge, isotopologue mass, molar mass, and m/z values. Element and isotope data come from elements-rs; optional features include serde, arbitrary/fuzzing, mem_size, and mem_dbg. The crate is no_std with alloc unless fuzzing or mem_dbg is enabled.
Installation
Add this to your Cargo.toml:
[]
= "0.1.9"
Usage
Here are some examples of how to use the library:
Basic Parsing and Properties
use FromStr;
use *;
// efficient u16 counters and i16 charge
// Note: You can use u32 or u64 for larger molecules.
let formula: ChemicalFormula = from_str.unwrap;
println!;
println!;
println!;
println!;
Complex Formulas, Hydrates and Ions
The parser handles parentheses, brackets, hydrates (dots), and charges with ease.
use FromStr;
use *;
// Copper(II) sulfate pentahydrate
let hydrate: ChemicalFormula = from_str.unwrap;
assert_eq!;
// An ion with unicode charge notation
let ion: ChemicalFormula = from_str.unwrap;
assert_eq!;
// Recursively nested groups
let complex: ChemicalFormula = from_str.unwrap;
Isotopes
You can specify isotopes using standard notation (superscripts or square brackets).
use FromStr;
use *;
// Carbon-13 labeled methane
let labeled: ChemicalFormula = from_str.unwrap;
// or
let labeled_unicode: ChemicalFormula = from_str.unwrap;
assert_eq!;
// Check if it contains specific isotopes
let c13 = try_from.unwrap;
assert!;
OCR-Resistant Parsing
The parser is designed to be robust against common OCR errors and unicode variations, handling multiple types of hyphens, dashes, and dots seamlessly.
use FromStr;
use *;
// Standard notation
let f1: ChemicalFormula = from_str.unwrap;
// OCR error: '。' (Halfwidth Ideographic Full Stop) instead of '.'
let f2: ChemicalFormula = from_str.unwrap;
assert_eq!;
// Standard charge
let c1: ChemicalFormula = from_str.unwrap;
// OCR error: Using En Dash '–' instead of Minus '-'
let c2: ChemicalFormula = from_str.unwrap;
assert_eq!;
InChI Formula Validation
The library supports strictly validated InChI-style formulas, which enforce Hill notation sorting (C first, H second, then alphabetical).
use FromStr;
use ParserError;
use *;
// Valid Hill-sorted formula
let valid: InChIFormula = from_str.unwrap;
// Invalid: Not Hill-sorted (O comes before H)
let invalid: = from_str;
assert_eq!;
Validation against PubChem
This library is tested against the PubChem database, which contains over 123 million compounds. This ensures correctness when parsing real-world chemical data.
We validate both ChemicalFormula (mass analysis) and InChIFormula (the formula layer of InChI).
Specifically, we download the CID-Mass.gz and CID-InChI-Key.gz documents, which can be found in the Extras FTP directory of PubChem.
You can run the validation suites yourself:
# Validate Mass Calculation (ChemicalFormula), takes about 55 seconds,
# most of which is just I/O time
# Validate InChI Parsing (InChIFormula), takes about 45 seconds,
# most of which is just I/O time
Validation Results (January 2026)
| Metric | Value |
|---|---|
| Total processed | 123,455,852 |
| Total time required | 58.68 s |
| Processing speed | 2,103,788 cmp/s |
| Exact matches | 66,465 |
| Within tolerance | 122,720,777 |
| Mismatches | 668,610 |
| - Ion mismatches | 106,525 |
| - Neutral mismatches | 562,085 |
| Mass accuracy (within 0.001) | 99.46% |
Note: The remaining ~0.5% mismatches are largely attributed to inconsistencies or errors in the source PubChem records rather than parsing errors.
You can find a report of the worst mismatches in worst_mismatches.md.
Benchmarks
This crate includes benchmarks to measure parsing performance for both InChIFormula and ChemicalFormula.
To run the benchmarks:
Current benchmarks cover:
- InChIFormula: Parsing a large mixture string with 76 components (~3.75 µs).
- ChemicalFormula: Parsing a complex formula with unicode subscripts, charges, and multiple elements (
C₃₉₀H₄₀₄B₂Br₂ClCs₂F₁₁K₂MnN₂₆Na₂O₁₀₀OsPdS₃W₂³⁻) (~801 ns).
Current Limitations
At this time, the parser does not support and might support in the future:
- Fractional counts (e.g.,
C1.5H3).
License
This project is licensed under the MIT License. See the LICENSE file for details.