Crate asciimath_parser

Expand description

A fast extensible memory-efficient asciimath parser

This parser produces a parsed tree representation rooted as an Expression. The parsed structure keeps references to the underlying string in order to avoid copies, but these strings must still be interpreted as the correct tokens to use the structure.

§Usage

cargo add asciimath-parser

then

asciimath_parser::parse("x / y");

§Performance

This library is meant to be a fast extensible parser. Most other rust libraries parse and format, or parse and evaluate, but don’t expose their underlying parsing logic. This parser produces a relatively simple parse tree whose tokens are slices of the input, so it does no extra string allocation. The parse bench measures parsing a corpus of hand-written examples and a batch of random expressions:

test example ... bench:       2,232 ns/iter (+/- 115)
test random  ... bench:     113,775 ns/iter (+/- 6,033)

§Dialect

Asciimath is a loose standard that aims for fault-tolerant parsing while looking close to what you might type in ascii if you were trying. However, other than the current buggy implementation, there’s no parse standard.

The parsing is written manually, so it doesn’t quite conform to this grammar, (which is also very ambiguous), but this grammar is close to the way asciimath actually interprets strings. In asciimath, left-right brackets have the highest precedence and almost any argument can be missing, save the first.

v ::= any char | greek letters | numbers | ... | missing
u ::= sqrt | text | bb | ...               unary symbols for font commands
f ::= sin | cos | ...                      function symbols
b ::= frac | root | stackrel | ...         binary symbols
l ::= ( | [ | { | (: | {: | ...            left brackets
r ::= ) | ] | } | :) | :} | ...            right brackets
d ::= '|' | '||'                           left-right brackets
R ::= E | E,R                              Matrix row expression
M ::= lRr | lRr,M                          Matrix expression
S ::= v | lEr | uS | fS | bSS | dEd | lMr  Simple expression
P ::= _S | ^S | _S^S                       Power expression
I ::= fP?I | SP?                           Intermediate expression
E ::= IE | I/I                             Expression

Left-right brackets are closed greedily, and must match the same string on both sides. If they can’t be matched they’ll be parsed as a symbol. This is particularly useful for probability conditioning, e.g. “p(x|y)”. For matrices, all left brackets must match, all right brackets must match, the number of separators (,) in each row must match, and there needs to be more than one element. This is more narrow than asciimath, but prevents need to have hardcoded rules for the difference between a set and a matrix.

This dialect results in many ways to parse things that conceptually might have the same meaning. "raw test" and text(raw text) might seem to have the same meaning, but the first is actually parsed as raw text, and the second is parsed as a unary function “text” with an argument. Similarly 1 / 2 and frac 1 2 both represent the same thing, but the first is a high level Frac construct, while the later is a binary operator called “frac”.

§Differences with Asciimath

Asciimaths parsing of left-right brackets is confusing, in particular the default way they handle expressions like ||x||. This library tokenizes “||” as one token and tries to match it that way, which produces different results than asciimath. Additionally, asciimath will sometimes put a phantom empty open brace if an expression ends on a “|”. This proved difficult to support and seems like an unuseful edgecase as it could always be substituted with “{: … :|”.

§Extensions to Asciimath

This parser is meant to be extensible, so if there are parts that don’t function as desired, they can be tweaked.

parse uses the default tokenizer, but parse_tokens can be used to parse an iterator of tuples (&str, Token) for any custom tokenization you write.

Custom tokenizer options can be used by creating an alternate Tokenizer using with_tokens.

use asciimath_parser::{parse_tokens, Tokenizer, ASCIIMATH_TOKENS};
use asciimath_parser::prefix_map::HashPrefixMap;

let token_map = HashPrefixMap::from_iter(ASCIIMATH_TOKENS);
let parsed = parse_tokens(Tokenizer::with_tokens("...", &token_map, false));

Nonstandard tokens can be used instead by creating custom token maps:

use asciimath_parser::{parse_tokens, Tokenizer, Token};
use asciimath_parser::prefix_map::HashPrefixMap;

let token_map = HashPrefixMap::from_iter([
    ("@", Token::Symbol),
    // ...
]);
let parsed = parse_tokens(Tokenizer::with_tokens("...", &token_map, true));

§Design

This parser tries to balance a few different goals which mediate it’s design:

simple - The “standard” asciimath parser is complicated, makes several passes, is relatively difficult to tweak or modify, is error-prone, and produces somewhat inconsistent results. By making this parser as simple as possible all of those should be relatively easy.
extensible - Asciimath isn’t a standard and there’s a lot about it that you might want to change, or add to suit a particular usecase.
efficient - Fast and with as little memory as possible. Because the asciimath parse trees are trees, some heap allocation is necessary to store the recursive structure.

As a result, this parser produces a parsed representation, but doesn’t attach any meanings to the tokens in the parsed tree. The default parser treats both “*” and “cdot” as tokens, but doesn’t say anywhere that they should be rendered the same. This choice was made so that you could easily add or remove tokens, or even change their meaning, and this library doesn’t have to know.

If you want to consume this output and make sure the tokens are parsed correctly, you can use the exported const version of the tokens uses to parse. By default parse uses crate::ASCIIMATH_TOKENS

§Tree Structure

The parsed representation is a tree like structure that has a hierarchy of types that roughly follows Expression -> Intermediate -> Frac -> ScriptFunc -> SimpleScript / Func -> Simple. The exceptions to this hierarchy are Group and Matrix that are both “simple” structures, but contain nested expressions. All of these types implement From from their singleton children, allowing promoting simple types to more complex ones with minimal overhead. All of their members are public allowing destructuring, especially with the box_patterns feature. See tree for more details.

use asciimath_parser::tree::{Expression, Simple};

let expr = Expression::from_iter([Simple::Ident("x")]);

§Manual creation

Most the tree structures implement From any of their singular upstream components, and most constructors support anything implementing Into, meaning that you only need only need to construct the lowest level argument, and it will get upcast to a higher tree structure as you need it.

For example:

let expr = Expression::from_iter([Simple::Ident("x")]);

Modules§

prefix_map: PrefixMaps are string keyed maps that support finding values with a longest prefix
tree: The module containing all of the structures that defined a parsed representation of asciimath

Structs§

Tokenizer: A tokenizer where unknown characters are parsed as individual identifiers

Enums§

Token: A parsed token label

Constants§

ASCIIMATH_TOKENS: The tokens for standard asciimath

Functions§

parse: Parse a string returning an asciimath expression
parse_tokens: Parse a tokenized expression