Crate asciimath_parser

source ·
Expand description

A fast extensible memory-efficient asciimath parser

This parser produces a parsed tree representation rooted as an Expression. The parsed structure keeps refrences to the underlying string in order to avoid copies, but these strings must still be interpreted as the correct tokens to use the structure.

Usage

cargo add asciimath-parser

then

asciimath_parser::parse("x / y");

Comparisons

This library is meant to be a fast extensible parser. There are a number of rust libraries that parse and format, or parse and evaluate, but don’t expose their underlying parsing logic. Only asciimath_rs actual parses expressions. However, that parser allocates extra strings, and produces a relatively complicated parse tree. This creates a relatively simpler parse tree with string slices as tokens. This allows this parser be several times faster than asciimath_rs.

test asciimath_parser::example ... bench:       7,912 ns/iter (+/- 1,348)
test asciimath_rs::example     ... bench:      41,605 ns/iter (+/- 14,262)

test asciimath_parser::random  ... bench:     360,495 ns/iter (+/- 32,231)
test asciimath_rs::random      ... bench:   2,522,810 ns/iter (+/- 168,133)

Dialect

Asciimath is a loose standard that aims for fault-tolerant parsing while looking close to what you might type in ascii if you were trying. However, other than the current buggy implementation, there’s no parse standard.

The parsing is written manually, so it doesn’t quite conform to this grammar, (which is also very ambiguous), but this grammar is close to the way asciimath actually interprets strings. In asciimath, left-right brackets have the highest precidence and almost any argument can be missing, save the first.

v ::= any char | greek letters | numbers | ... | missing
u ::= sqrt | text | bb | ...               unary symbols for font commands
f ::= sin | cos | ...                      function symbols
b ::= frac | root | stackrel | ...         binary symbols
l ::= ( | [ | { | (: | {: | ...            left brackets
r ::= ) | ] | } | :) | :} | ...            right brackets
d ::= '|' | '||'                           left-right brackets
R ::= E | E,R                              Matrix row expression
M ::= lRr | lRr,M                          Matrix expression
S ::= v | lEr | uS | fS | bSS | dEd | lMr  Simple expression
P ::= _S | ^S | _S^S                       Power expressiong
I ::= fP?I | SP?                           Intermediate expression
E ::= IE | I/I                             Expression

Left-right brackets are closed greedily, and must match the same string on both sides. If they can’t be matched they’ll be parsed as a symbol. This is particularly useful for probabilitiy conditioning, e.g. “p(x|y)”. For matrices, all left brackets must match, all right brackets must match, the number of seperators (,) in each row must match, and there needs to be more than one element. This is more narrow than asciimath, but prevents need to have hardcoded rules for the difference between a set and a matrix.

This dialect results in many ways to parse things that conceptually might have the same meaning. "raw test" and text(raw text) might seem to have the same meaning, but the first is actually parsed as raw text, and the second is parsed as a unary function “text” with an argument. Similarly 1 / 2 and frac 1 2 both represent the same thing, but the first is a high level Frac construct, while the later is a binary operator called “frac”.

Differences with Asciimath

Asciimaths parsing of left-right brackets is confusing, in particular the default way they handle expressions like ||x||. This library tokenizes “||” as one token and tries to match it that way, which produces different results than asciimath. Additionally, asciimath will sometimes put a phantom empty open brace if an expression ends on a “|”. This proved difficult to support and seemes like an unuseful edgecase as it could always be substituted with “{: … :|”.

Extensions to Asciimath

This parser is meant to be extensible, so if there are parts that don’t function as desired, they can be tweaked.

  1. parse uses the default tokenizer, but parse_tokens can be used to parse an iterator of tuples (&str, Token) for ant custom tokenization you write.
  2. Custom tokenizer options can be used by creating an alternate Tokenizer using with_tokens.
    use asciimath_parser::{parse_tokens, Tokenizer, ASCIIMATH_TOKENS};
    use asciimath_parser::prefix_map::HashPrefixMap;
    
    let token_map = HashPrefixMap::from_iter(ASCIIMATH_TOKENS);
    let parsed = parse_tokens(Tokenizer::with_tokens("...", &token_map, false));
  3. Nonstandard tokens can be used instead by creating custom token maps:
    use asciimath_parser::{parse_tokens, Tokenizer, Token};
    use asciimath_parser::prefix_map::HashPrefixMap;
    
    let token_map = HashPrefixMap::from_iter([
        ("@", Token::Symbol),
        // ...
    ]);
    let parsed = parse_tokens(Tokenizer::with_tokens("...", &token_map, true));

Design

This parser tries to balance a few different goals which mediate it’s design:

  1. simple - The “standard” asciimath parser is complicated, makes several passes, is relatively difficult to tweak or modify, is error-prone, and produces somewhat inconsistent results. By making this parser as simple as possible all of those should be relatively easy.
  2. extensible - Asciimath isn’t a standard and there’s a lot about it that you might want to change, or add to suit a particular usecase.
  3. efficient - Fast and with as little memory as possible. Because the asciimath parse trees are trees, some heap allocation is necessary to store the recursive structure.

As a result, this parser produces a parsed representation, but doesn’t attach any meanings to the tokens in the parsed tree. The default parser treats both “*” and “cdot” as tokens, but doesn’t say anywhere that they should be rendered the same. This choice was made so that you could easily add or remove tokens, or even change their meaning, and this library doesn’t have to know.

If you want to consume this output and make sure the tokens are parsed correctly, you can use the exported const version of the tokens uses to parse. By default parse uses crate::ASCIIMATH_TOKENS

Tree Structure

The parsed representation is a tree like structure that has a hierarchy of types that roughly follows Expression -> Intermediate -> Frac -> ScriptFunc -> SimpleScript / Func -> Simple. The exceptions to this hierarchy are Group and Matrix that are both “simple” structures, but contain nested expressions. All of these types implement From from their singleton children, allowing promoting simple types to more complex ones with minimal overhead. All of their members are public allowing destrucutring, especially with the box_patterns feature. See tree for more details.

use asciimath_parser::tree::{Expression, Simple};

let expr = Expression::from_iter([Simple::Ident("x")]);

Manual creation

Most the tree structures implement From any of their singular upstream components, and most constructors support anything implementing Into, meaning that you only need only need to construct the lowest level argument, and it will get upcast to a higher tree structure as you need it.

For example:

let expr = Expression::from_iter([Simple::Ident("x")]);

Modules

  • PrefixMaps are string keyed maps that support finding values with a longest prefix
  • The module containing all of the structures that defined a parsed representation of asciimath

Structs

  • A tokenizer where unknown characters are parsed as individual identifiers

Enums

  • A parsed token label

Constants

Functions

  • Parse a string returning an asciimath expression
  • Parse a tokenized expression