asciimath_parser/
lib.rs

1//! A fast extensible memory-efficient asciimath parser
2//!
3//! This parser produces a parsed tree representation rooted as an
4//! [`Expression`][tree::Expression]. The parsed structure keeps refrences to the underlying string
5//! in order to avoid copies, but these strings must still be interpreted as the correct tokens to
6//! use the structure.
7//!
8//! ## Usage
9//!
10//! ```sh
11//! cargo add asciimath-parser
12//! ```
13//!
14//! then
15//!
16//! ```
17//! asciimath_parser::parse("x / y");
18//! ```
19//!
20//! ### Comparisons
21//!
22//! This library is meant to be a fast extensible parser. There are a number of rust libraries that
23//! parse and format, or parse and evaluate, but don't expose their underlying parsing logic. Only
24//! `asciimath_rs` actual parses expressions. However, that parser allocates extra strings, and
25//! produces a relatively complicated parse tree. This creates a relatively simpler parse tree with
26//! string slices as tokens. This allows this parser be several times faster than `asciimath_rs`.
27//!
28//! ```txt
29//! test asciimath_parser::example ... bench:       7,912 ns/iter (+/- 1,348)
30//! test asciimath_rs::example     ... bench:      41,605 ns/iter (+/- 14,262)
31//!
32//! test asciimath_parser::random  ... bench:     360,495 ns/iter (+/- 32,231)
33//! test asciimath_rs::random      ... bench:   2,522,810 ns/iter (+/- 168,133)
34//! ```
35//!
36//! ## Dialect
37//!
38//! Asciimath is a loose standard that aims for fault-tolerant parsing while looking close to what
39//! you might type in ascii if you were trying. However, other than the [current buggy
40//! implementation](https://github.com/asciimath/asciimathml/blob/master/ASCIIMathML.js), there's
41//! no parse standard.
42//!
43//! The parsing is written manually, so it doesn't quite conform to this grammar, (which is also
44//! very ambiguous), but this grammar is close to the way asciimath actually interprets strings. In
45//! asciimath, left-right brackets have the highest precidence and almost any argument can be
46//! [missing][tree::Simple::Missing], save the first.
47//!
48//! ```txt
49//! v ::= any char | greek letters | numbers | ... | missing
50//! u ::= sqrt | text | bb | ...               unary symbols for font commands
51//! f ::= sin | cos | ...                      function symbols
52//! b ::= frac | root | stackrel | ...         binary symbols
53//! l ::= ( | [ | { | (: | {: | ...            left brackets
54//! r ::= ) | ] | } | :) | :} | ...            right brackets
55//! d ::= '|' | '||'                           left-right brackets
56//! R ::= E | E,R                              Matrix row expression
57//! M ::= lRr | lRr,M                          Matrix expression
58//! S ::= v | lEr | uS | fS | bSS | dEd | lMr  Simple expression
59//! P ::= _S | ^S | _S^S                       Power expressiong
60//! I ::= fP?I | SP?                           Intermediate expression
61//! E ::= IE | I/I                             Expression
62//! ```
63//!
64//! Left-right brackets are closed greedily, and must match the same string on both sides. If they
65//! can't be matched they'll be parsed as a symbol. This is particularly useful for probabilitiy
66//! conditioning, e.g. "p(x|y)". For matrices, all left brackets must match, all right brackets
67//! must match, the number of seperators (,) in each row must match, and there needs to be more
68//! than one element. This is more narrow than asciimath, but prevents need to have hardcoded rules
69//! for the difference between a set and a matrix.
70//!
71//! This dialect results in many ways to parse things that conceptually might have the same
72//! meaning. `"raw test"` and `text(raw text)` might seem to have the same meaning, but the first
73//! is actually parsed as raw text, and the second is parsed as a unary function "text" with an
74//! argument. Similarly `1 / 2` and `frac 1 2` both represent the same thing, but the first is a
75//! high level [`Frac`][tree::Frac] construct, while the later is a binary operator called "frac".
76//!
77//! ### Differences with Asciimath
78//!
79//! Asciimaths parsing of left-right brackets is confusing, in particular the default way they
80//! handle expressions like ||x||. This library tokenizes "||" as one token and tries to match it
81//! that way, which produces different results than asciimath. Additionally, asciimath will
82//! sometimes put a phantom empty open brace if an expression ends on a "|". This proved difficult
83//! to support and seemes like an unuseful edgecase as it could always be substituted with
84//! "{: ...  :|".
85//!
86//! ### Extensions to Asciimath
87//!
88//! This parser is meant to be extensible, so if there are parts that don't function as desired,
89//! they can be tweaked.
90//!
91//! 1. [`parse`][crate::parse()] uses the default tokenizer, but
92//!    [`parse_tokens`][crate::parse_tokens] can be used to parse an iterator of tuples `(&str,
93//!    Token)` for ant custom tokenization you write.
94//! 2. Custom tokenizer options can be used by creating an alternate [`Tokenizer`] using
95//!    [`with_tokens`][Tokenizer::with_tokens].
96//!    ```
97//!    use asciimath_parser::{parse_tokens, Tokenizer, ASCIIMATH_TOKENS};
98//!    use asciimath_parser::prefix_map::HashPrefixMap;
99//!
100//!    let token_map = HashPrefixMap::from_iter(ASCIIMATH_TOKENS);
101//!    let parsed = parse_tokens(Tokenizer::with_tokens("...", &token_map, false));
102//!    ```
103//! 3. Nonstandard tokens can be used instead by creating custom token maps:
104//!    ```
105//!    use asciimath_parser::{parse_tokens, Tokenizer, Token};
106//!    use asciimath_parser::prefix_map::HashPrefixMap;
107//!
108//!    let token_map = HashPrefixMap::from_iter([
109//!        ("@", Token::Symbol),
110//!        // ...
111//!    ]);
112//!    let parsed = parse_tokens(Tokenizer::with_tokens("...", &token_map, true));
113//!    ```
114//!
115//! ## Design
116//!
117//! This parser tries to balance a few different goals which mediate it's design:
118//! 1. simple - The "standard" asciimath parser is complicated, makes several passes, is relatively
119//!    difficult to tweak or modify, is error-prone, and produces somewhat inconsistent results. By
120//!    making this parser as simple as possible all of those should be relatively easy.
121//! 2. extensible - Asciimath isn't a standard and there's a lot about it that you might want to
122//!    change, or add to suit a particular usecase.
123//! 3. efficient - Fast and with as little memory as possible. Because the asciimath parse trees
124//!    are trees, some heap allocation is necessary to store the recursive structure.
125//!
126//! As a result, this parser produces a parsed representation, but doesn't attach any meanings to
127//! the tokens in the parsed tree. The default parser treats both "*" and "cdot" as tokens, but
128//! doesn't say anywhere that they should be rendered the same. This choice was made so that you
129//! could easily add or remove tokens, or even change their meaning, and this library doesn't have
130//! to know.
131//!
132//! If you want to consume this output and make sure the tokens are parsed correctly, you can use
133//! the exported const version of the tokens uses to parse. By default [`parse`][crate::parse()]
134//! uses [crate::ASCIIMATH_TOKENS]
135//!
136//! ## Tree Structure
137//!
138//! The parsed representation is a tree like structure that has a hierarchy of types that roughly
139//! follows [`Expression`][tree::Expression] -> [`Intermediate`][tree::Intermediate] ->
140//! [`Frac`][tree::Frac] -> [`ScriptFunc`][tree::ScriptFunc] ->
141//! [`SimpleScript`][tree::SimpleScript] / [`Func`][tree::Func] -> [`Simple`][tree::Simple]. The
142//! exceptions to this hierarchy are [`Group`][tree::Group] and [`Matrix`][tree::Matrix] that are
143//! both "simple" structures, but contain nested expressions. All of these types implement `From`
144//! from their singleton children, allowing promoting simple types to more complex ones with
145//! minimal overhead. All of their members are public allowing destrucutring, especially with the
146//! `box_patterns` feature. See [`tree`][crate::tree] for more details.
147//!
148//! ```
149//! use asciimath_parser::tree::{Expression, Simple};
150//!
151//! let expr = Expression::from_iter([Simple::Ident("x")]);
152//! ```
153//!
154//! ### Manual creation
155//!
156//! Most the tree structures implement [From] any of their singular upstream components, and most
157//! constructors support anything implementing [Into], meaning that you only need only need to
158//! construct the lowest level argument, and it will get upcast to a higher tree structure as you
159//! need it.
160//!
161//! For example:
162//! ```
163//! # use asciimath_parser::tree::{Expression, Simple};
164//! let expr = Expression::from_iter([Simple::Ident("x")]);
165//! ```
166#![warn(missing_docs)]
167mod parse;
168pub mod prefix_map;
169mod tokenizer;
170pub mod tree;
171
172pub use parse::{parse, parse_tokens};
173pub use tokenizer::{Token, Tokenizer, ASCIIMATH_TOKENS};