trivet 3.1.0 - Docs.rs

# Parsing Primitives

This chapter walks through the parsing primitives of **Trivet**.

## Making a New Parser

Instantiate a new parser with the `Parser::new` method. You need to specify the following arguments.

- A human-readable name for the parsing source. This would ideally be a file name, but it might also be something like `"<console>"` to indicate that standard input is being parsed.
- A source of characters to parse, given by an instance of `trivet::decoder::Decode`.

There are convenience functions for creating files from different common sources. Using one of these methods avoids the need to use `Decode` directly.

| Function            | Use                                                  |
| ------------------- | ---------------------------------------------------- |
| `parse_from_bytes`  | Create a parser for a byte sequence (`&[u8]`)        |
| `parse_from_string` | Create a parser for a string (`&str`)                |
| `parse_from_stdin`  | Create a parser for standard input                   |
| `parse_from_path`   | Create a parser for a file specified by a `&PathBuf` |

## Peeking and Consuming

The primary actions are _peeking_ at the stream, which allows you to look ahead at upcoming bytes in the stream, and _consuming_, which discards bytes from the stream.

Note that peeking and consuming work on _characters_. The input stream is decoded by the `trivet::decode::Decode` struct that provides UTF-8 characters to the rest of the parser and supports reading both UTF-16 and UTF-8.

| Method                                               | Use                                                   |
| ---------------------------------------------------- | ----------------------------------------------------- |
| `peek() -> ParseResult<Option<char>>`                | Peek (look ahead) at the next character in the stream |
| `consume()`                                          | Discard the next character from the stream            |
| `is_at_eof() -> bool`                                | Return true if the stream is exhausted                |
| `peek_and_consume(ch: char) -> ParseResult<bool>`    | Peek and optionally consume (see below)               |
| `peek_and_consume_ws(ch: char) -> ParseResult<bool>` | Peek and optionally consume (see below)               |

You will probably use `peek` and `consume` quite a bit to deal with single characters, but these are not ideal for larger things like keywords.

Often if you peek and match a specific character, you will then want to consume it. For example, you might have code like this.

```rust,ignore
if parser.peek() == '"' {
    parser.consume();
    // Do whatever else needs to be done.
}
```

This can be simplified with `peek_and_consume(ch: char) -> bool`. This method checks to see if the next character is `ch` and, if so, consumes it and then returns `true`. We can replace the above with the following.

```rust,ignore
if parser.peek_and_consume('"') {
    // Do whatever else needs to be done.
}
```

For example, if the stream contains `" and more`, then the above would match and consume the `"` and leave the stream pointing at the space following it.

Often once you have matched something like a closing brace, you then want to consume any trailing whitespace. The `peek_and_consume_ws(ch: char) -> bool` method does exactly that.

For example, if the stream contains `" and more`, then the above would match and consume the `"` and traling space, and leave the stream pointing at `a`.

Internally the parser maintains a small lookahead buffer that is filled only as needed from the decoder.

## Detecting Stalling

**Trivet** keeps track of the number of times you _peek_ at a byte and, if you peek too many times (1,000 by default; see `trivet::PEEK_LIMIT`) without consuming anything, **Trivet** assumes the parse has stalled and panics. Likewise, if you try to consume too many times (1,000 by default; see `trivet::EOF_LIMIT`) after the end of file has been reached, **Trivet** will assume the parse is stalled and panics.

## Getting the Parse Location

| Method         | Use                            |
| -------------- | ------------------------------ |
| `loc() -> Loc` | Get current location in parse. |

The current position (the location of the _next_ character) is given by `loc() -> Loc`. This returns a structure that contains the name of the source (specified when the parser was created) along with the 1-based line number and 1-based column number.

A useful tactic for parsing large structures is to save the location (the returned `Loc`) with the structure so that, if later processing reveals an error, you can cite the location of the error in the stream. For example, in the example given in [Building Parsers](building.md) the location of each token parsed is stored.

```rust,ignore
# // Include the type for Thing.
{{#include ../../examples/book_building_sequence.rs:15:19}}
```

At the start of parsing a structure, you may want to save the current location of the parse (`let loc = parser.loc();`) so you can refer to where the structure starts later on.

## Peeking for Strings

Working byte-by-byte is not very efficient, so **Trivet** has other versions of _peek_ and _consume_. The most common ones work based on matching upcoming strings.

| Method                                         | Use                                                                                                    |
| ---------------------------------------------- | ------------------------------------------------------------------------------------------------------ |
| `peek_str(value: &str) -> bool`                | Peek at the next characters of input and return `true` iff they are the string `value`                 |
| `peek_str_greedy(value: &str) -> bool`         | Peek at the next characters of input and return `true` iff they are the string `value` (but see below) |
| `peek_and_consume_str(value: &str) -> bool`    | Peek and optionally consume (see below)                                                                |
| `peek_and_consume_str_ws(value: &str) -> bool` | Peek and optionally consume (see below)                                                                |

This is clearly better. You could look for a keyword with `peek_str("frida")`. This will return `true` iff the next five characters are `frida`.

> There are also methods that use `&[char]` instead of `&str`. See the section on [handling comments](#handling-comments) for an example of when those methods should be used.

What if you want to recognize a string like three double quotation marks: `"""`. An interesting question is what about _four_ double quotation marks `""""`? Should we match on the first three, or the last three? Well, `peek_str("\"\"\"")` will do just what you expect; it will match on the first three. That might not be what you want, so there is also a `peek_str_greedy("\"\"\"")` that will parse this as: `"` and then `"""`. That is, it will _not_ match at the start of the sequence, but _will_ match after the first quotation mark has been consumed.

More conveniently, `peek_and_consume_str(value: &str) -> bool` peeks at the stream and, if it finds `value` as the next sequence of characters, consumes all characters and then returns `true`. Otherwise it returns `false`. The method `peek_and_consume_str_ws(value: &str) -> bool` does the same thing but also consumes any trailing whitespace.

> Lookahead is limited to the length of the buffer, which is 64 KiB by default (see `trivet::MAX_LOOKAHEAD`).

## Handling Whitespace

As you've seen, **Trivet** requires you to handle whitespace directly. In reality, parsers always need to know where whitespace is allowed. **Trivet** just requires you to be a little more explicit about it.

| Method                      | Use                                                                                                              |
| --------------------------- | ---------------------------------------------------------------------------------------------------------------- |
| `consume_ws_only() -> bool` | Consume all whitespace and leave the parser pointing to the first non-whitespace character.                      |
| `consume_ws() -> bool`      | Consume all whitespace and leave the parser pointing to the first non-whitespace character; may consume comments |

> Some languages have _significant_ whitespace, like [YAML][] and [Python][]. Tracking the initial whitespace can still be done by checking the parse location using `get_column_number() -> usize` to get the current parse location. In these cases you may want to handle comments and whitespace separately.

You can consume all whitespace using either of these methods. This leaves the parser pointing at the first non-whitespace character. Because `consume_ws() -> bool` also looks for comments, it is a bit slower than `consume_ws_only() -> bool`, but it also handles comments automatically, so it really depends on your application.

It is important to be _consistent_ with this. The common strategy is for a method that parses a specific structure to do the following.

- Assume _on entry_ that the parser points to the first character of the thing to parse.
- Consume any trailing whitespace after the structure before returning.
- Add a `_ws` suffix to a method that consumes trailing whitespace.

For more on how comments are handled, see the [section on comments](#handling-comments).

## Handling Comments

**Trivet** comes with support for consuming (and discarding) comments in several different forms. By default only C and C++-style comment parsing is enabled.

Comment parsing is actually done by the struct `trivet::comments::CommentParser`. A comment parser is already installed in an instance of `trivet::Parser`, and you can obtain mutable access to it to configure it using `borrow_comment_parser() -> &mut CommentParser`.

| Method                                          | Use                                                        |
| ----------------------------------------------- | ---------------------------------------------------------- |
| `borrow_comment_parser() -> &mut CommentParser` | Obtain a mutable reference to the internal comment parser. |

The following forms of comment are supported.

| Comment Style  | Flag                | Default |
| -------------- | ------------------- | ------- |
| `/* ... */`    | `enable_c`          | true    |
| `// ...`       | `enable_cpp`        | true    |
| `<# ... #>`    | `enable_powershell` | false   |
| `# ...`        | `enable_python`     | false   |
| `<!-- ... -->` | `enable_xml`        | false   |

In addition to these comment forms, you can add your own. Define a method that does the following.

- Accepts a `trivet::ParserCore` instance
- Consumes any comments you wish
- Returns `true` if any comments are consumed

The flag is important; it is used to determine when all comments have been consumed by `CommentParser::process`, which is used by `consume_ws() -> bool`.

Set the `custom` field of the comment parser to a `Box` instance containing your method, and then set
the `enable_custom` field on the comment parser to `true`.

The following is an example that parses [Lua][] comments, which are a bit notorious for being persnickety. `--[[` starts a multi-line comment, but `---[[` starts a single-line comment (because the `--` begins a single line comment unless immediately followed by `[[`).

```rust,ignore
{{#include ../../examples/book_primitives_lua.rs}}
```

Note that inside the body of the closure we use `peek_and_consume_chars(&[char]) -> bool` instead of the (simpler) `peek_and_consume_str(&str) -> bool`. This is because we have to use the `ParserCore` instead of `Parser`, and also because `peek_and_consume_chars(&[char]) -> bool` is a bit faster.

## Parsing Sequences

Often you want to parse a stream of characters of a specific class (like digits). **Trivet** provides convenience methods for this.

| Method                                                            | Use                                                |
| ----------------------------------------------------------------- | -------------------------------------------------- |
| `take_while(Fn(char) -> bool) -> String`                          | Consume characters while a predicate is true.      |
| `take_while_unless(Fn(char) -> bool, Fn(char) -> bool) -> String` | Consume characters, but exclude some.              |
| `take_until(Fn(char) -> bool) -> String`                          | Consume characters until a given predicat is true. |

Suppose you want to parse a hexadecimal number. You could write the following.

```rust,ignore
let digits = parse.take_while(|ch| ch.is_hexadecimal_digit());
```

The return value will be all hexadecimal digits (if any) that are read.

In Python and some other languages, you can include underscores in numbers to break them into groups, such as `1_235_400` instead of `1235400`. That is easy to accomplish with a slightly different form.

```rust,ignore
let digits = parse.take_while_unless(|ch| ch.is_digit(), |ch| ch == '_');
```

Here parsing consumes both the digits and the underscores, but only the digits are returned.

Finally, sometimes you want to consume everything until an end token is found, such as the end of line. You can use `take_until(&str) -> ParseResult<String>` for that. This method uses the _greedy_ version of matching described for `peek_str_greedy(&str) -> ParseResult<bool>` [above](primitives.md#peeking-for-strings), so it will handle things like multi-line Python comments correctly.

```rust,ignore
let text = parse.take_until("\"\"\"")?;
```

[YAML]: https://yaml.org/
[Python]: https://www.python.org/
[Lua]: https://www.lua.org/