tuck5 0.1.3 - Docs.rs

# tuck5: The Next Attempt

(at making a parser generator)

Tuck5 is a lexer/parser generation library written in Rust with a dual focus on 
performance and ease of use. The `meta` system is the main, and recommended,
method of interacting with tuck5. It provides a domain-specific language that 
allows developers to easily define how to transform a set of tokens. Under the
hood, this simple syntax is executed as quick for-loops, with an unprecedented
level of developer control of execution.

## Understanding the TUCK nature


tuck5 uses a system of tokens, as is to be expected by a tokenizer framework. 
Tokens are either branches or leaves. If a token is a leaf, it contains a range
of indices into its corpus text. If it is a branch, it contains a vector of 
tokens as its children. Tokens also contain some extra data: as the developer, 
you are able to statically control this data to fit your needs. (Note: if you 
are using the `meta` system, this auxiliary data takes the form of a vector of 
owned strings, which act as tags. A transition is planned to string pointers 
or perhaps a bitflags-based system. This does not affect the experience of 
using the `meta` system.)

To parse a corpus string, the characters of the string are initially converted 
into a series of leaf tokens where each token has two tags: `a` where `a` is 
the content of the token, `u97` where `97` is the decimal conversion of the 
UTF-8 value of the character, and `ws` ONLY if the character is classified as 
Unicode whitespace.

Then, a series of rules are applied to the tokens. A rule consists of a 
condition that may match a sequence of tokens, a transformation method 
(whether a matching sequence is converted into a single leaf token, a single 
branch token, or removed entirely), and a series of tags to apply to the token 
that the matching tokens will be converted into. For examples of this, see the 
section [[#Using the `meta` system]].

Each rule is applied on each token, one-by-one. Many rules accept multiple 
tokens. They have specifically defined behavior for reaching the end of the 
corpus string. Each rule is repeated applied to the corpus string until it no 
longer can be applied. If specified, a group of rules can be applied in 
sequence, repeatedly, until none of them can be applied. 

## Using the `meta` system


Define a meta-parser (a parser that is generated by the `meta` system) with 
this syntax.

### Rules


`'abc'` for any string `abc`: matches if the next token's content exactly 
equals the characters between the `'`s.

`a..z` for any characters `a` and `z`: matches if the next token is one 
character long, and that character is between `a` and `z` inclusive.

`{.abc.}` for any string `abc` with *x* characters: matches if the next *x* 
tokens exactly match each character in `abc`.

`tag` for any string `tag`: matches if the next token has the tag `tag`.

`rule1 & rule2`: matches if `rule1` applies from token *x* to token *y*, then 
`rule2` applies from token *y + 1* to token *z*. Can be chained.

`rule1 | rule2`: matches if either `rule1` or `rule2` applies. Can be chained.

`rule?`: If `rule` matches the next *x* tokens, this matches for *x* tokens. 
Otherwise, matches 0 tokens. Not intended to be used alone (will probably lock 
your parser up until warnings are implemented).

`rule*`: Repeatedly matches `rule`. If `rule` matches tokens *x* to *y*, 
tries again on tokens *y* to *z* and so on. If `rule` doesn't match, matches 
all the tokens that `rule` did match on. Once again, not intended to be used 
alone, as it will match even if `rule` matched 0 times.

`rule+`: Repeatedly matches `rule` after matching it one time: see above. 
Equivalent to `rule & rule*`. Can be used alone.

### Transformations


`rule . tag0, tag1, etc;`: If `rule` matches the next *x* tokens, transforms 
those tokens into one leaf token (discarding the data of the "children") with 
the specified tags. More performant than the alternative below.

`rule : tag0, tag1, etc;`: If `rule` matches the next *x* tokens, transforms 
those tokens into one branch token (with the transformed tokens as the 
children) with the specified tags. Useful for converting into an AST.

`rule~;`: If `rule` matches the next *x* tokens, removes those tokens. Most 
often used for whitespace removal, but if AST generation is the goal, it can 
also be used for comments.

### Miscellaneous


`# bla bla bla (line break)`: A one-line comment.

`## bla bla bla ##`: A multi-line comment.

`{ transformation transformation ... }`: Apply each of these transformations 
in order until none of them match. Useful for preserving an order of 
operations.

`(rule)`: The same as a rule. Useful for the order of operations of the `meta` 
system or for clarity.

### Examples


See the `calculator` folder in `tests` for a full example of the `meta` system.
Here is an explanation of a simpler version of it.

```
(0..9)+. positive, int, expr;
```

Here, we define our positive integers. `0..9` represents a range of characters,
which will match against the digits from 0 to 9. The `+` signifies to match 
against one or more digits.

```
int & '.' & int. positive, decimal, expr;
```

Here, `int` matches against any token we recently defined with the tag `int`.
If the sequence of an integer, followed by a decimal point, followed by an 
integer (which, as we defined, can have leading zeroes) is found, these are 
transformed into a decimal token.

```
'-' & positive. negative, expr;
```

Now we define the negative numbers. If a minus sign is matched, then a positive
number, we apply the `negative` tag.

```
ws~;
```

Here, we remove whitespace, as specified by the pre-defined `ws` tag.

```
{
    '(' & expr & ')': parens, expr;
}
```

The braces around this statement means that there are more to follow it, and 
they should be executed in sequence, repeatedly. We will see this in action 
when we add the operations. For now, if parentheses surround a token with the 
`expr` tag (which you may have noticed was added to every number), all three 
tokens will be transformed into one with the `parens` and `expr` tags. 

```
{
    '(' & expr & ')': parens, expr;
    expr & '*' | '/' & expr: oper, expr;
}
```

Here, we add multiplication and division to our repeated transformations. With 
the braces, you'll find that any sequence of nested parentheses, 
multiplications, and divisions will follow the rules of PEMDAS.

```
{
    '(' & expr & ')': parens, expr;
    expr & '*' | '/' & expr: oper, expr;
    expr & '+' | '-' & expr: oper, expr;
}
```

Finally, we complete our simple calculator by adding addition and subtraction. 
To access our tagged tokens in Rust, use the `tuck5::meta::eval_prog_from_text`
function with text of your meta-parser and the text you'd like parsed. The 
result will be a vector of tokens.