trivet 3.1.0 - Docs.rs

# Parsing Strings

By default, **Trivet** can process strings containing simple escapes like `\n`, ASCII escapes like `\x0d`, Unicode escapes like `\u{2020}`, and named Unicode escapes like `\N{dagger}`. The string parser is highly configurable.

To parse a string starting at the current position in the parse, you can use one of the following methods available in `trivet::Parser`.

| Method                                                         | Use                                                                                                                                                                                                                                    |
| -------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `parse_string_match_delimiter() -> ParseResult<String>`        | Parse a delimited string. Assume the parser is currently _at the opening delimiter_ and that the end delimiter should match.                                                                                                           |
| `parse_string_until_delimiter(char) -> ParseResult<String>`    | Parse a delimited string. Assume the opening delimiter has _already been consumed_ and the closing delimiter is given by the argument.                                                                                                 |
| `parse_string(&str) -> ParseResult<String>`                    | Parse the provided string. This method creates a new `Parser` instance around the provided string and parses it. It is useful if you have some other method for capturing the string content and then want to parse what you captured. |
| `parse_string_match_delimiter_ws() -> ParseResult<String>`     | Parse a delimited string. Assume the parser is currently _at the opening delimiter_ and the end delimiter should match. Consume any trailing whitespace.                                                                               |
| `parse_string_until_delimiter_ws(char) -> ParseResult<String>` | Parse a delimited string. Assume the opening delimiter has _already been consumed_ and the closing delimiter is given by the argument. Consume any trailing whitespace.                                                                |

The important thing to keep in mind about these methods is whether or not the starting delimiter should be consumed. For `parse_string_match_delimiter() -> ParseResult<String>` you should _not_ consume the opening delimiter, as the method will use it to determine the closing delimiter. For `parse_string_until_delimiter(char) -> ParseResult<String>` you _should_ consume the starting delimiter or you will end up getting an empty string. None of this matters for `parse_string(&str) -> ParseResult<String>`, as it does not use delimiters.

The following is a short program to parse a series of strings from standard input and then print them out. The strings can be enclosed in double quotation marks, single quotation marks, or double-angle quotation marks (U+00AB '«' and U+00BB '»').

```rust,ignore
{{#include ../../examples/book_string_simple.rs}}
```

This might seem like a lot of code, but keep in mind that it transparently handles escapes in the strings.

A more sophisticated program for playing with strings and encodings can be found in the examples folder of the distribution in `stringy.rs`.

## String Parser

String parsing is actually performed by the struct `trivet::strings::StringParser`. A string parser is already installed in each instance of `trivet::Parser`, and you can obtain mutable access to it to configure it using `borrow_string_parser() -> &mut StringParser`.

| Method                                        | Use                                                       |
| --------------------------------------------- | --------------------------------------------------------- |
| `borrow_string_parser() -> &mut StringParser` | Obtain a mutable reference to the internal string parser. |

There are several options discussed in the [string standards](#string-standards) and [configuration](#configuration) sections.

## String Standards

Several settings control string parsing and are discussed in the [configuration section](#configuration). To simplify things, you can set everything at once by selecting a _string standard_.

| Standard | Use                                                |
| -------- | -------------------------------------------------- |
| Trivet   | Use the **Trivet** standard.                       |
| Python   | Try to parse strings in the same manner as Python. |
| Rust     | Try to parse strings in the same manner as Rust.   |
| JSON     | Try to parse strings in the same manner as JSON.   |
| C        | Try to parse strings[^c] in the same manner as C.  |

String standards are set on the `StringParser` instance. For example, the following will set the parser to the Rust standard.

```rust,ignore
use trivet;

let parser = trivet::parse_from_string("text");
parser.borrow_string_parser().set(trivet::strings::StringStandard::Rust);
```

String standards are provided by the `trivet::string::StringStandard` enum. Once you have selected a standard, you can feel free to modify any configuration settings you wish.

## Configuration

**Trivet** string parsing is highly configurable. For instance, you can configure the following.

- Whether escape characters are processed.
- What character introduces an escape.
- The various escape meanings.
- Whether "surrogate pairs" are allowed in string encoding.
- How to handle illegal Unicode.
- How to handle undefined escape characters.

The following are the configuration options (except escapes, discussed below). These must be accessed through the `StringParser` instance.

| Option                                                        | Trivet                 | Python                 | Rust    | JSON                   | C                      |
| ------------------------------------------------------------- | ---------------------- | ---------------------- | ------- | ---------------------- | ---------------------- |
| `enable_escapes`<br>Whether to process escapes                | `true`                 | `true`                 | `true`  | `true`                 | `true`                 |
| `escape_char`<br>Character that introduces an escape          | `\`                    | `\`                    | `\`     | `\`                    | `\`                    |
| `unknown_escape_protocol`<br>Handling an unrecognized escape  | `LiteralEscape`        | `LiteralEscape`        | `Error` | `DropEscape`           | `LiteralEscape`        |
| `allow_surrogate_pairs`<br>Whether to allow surrogate pairs   | `true`                 | `false`                | `false` | `true`                 | `false`                |
| `illegal_unicode_protocol`<br>Handling illegal Unicode        | `ReplacementCharacter` | `ReplacementCharacter` | `Error` | `ReplacementCharacter` | `ReplacementCharacter` |
| `allow_octal_escapes`<br>Whether octal escapes are allowed    | `true`                 | `true`                 | `false` | `false`                | `true`                 |
| `octal_escapes_are_flexible`<br>Allow fewer than three digits | `true`                 | `true`                 | `false` | `false`                | `true`                 |

See the `trivet::strings::UnknownEscapeProtocol` and `trivet::strings::IllegalUnicodeProtocol` enums for their options.

The escapes supported by each standard are given by supplying a `std::collections::BTreeMap<char, EscapeType>` instance. This maps each a character to the escape type that it represents. For example, `n` is mapped to `EscapeType::Char('\n')`. Several escape types are supported. See the `trivet::strings::EscapeType` enum for details.

For example, the following is the escape specification for the **Trivet** string standard.

| Character | Escape Type                 |
| --------- | --------------------------- |
| \n        | `EscapeType::Discard`       |
| \         | `EscapeType::Char('\\')`    |
| '         | `EscapeType::Char('\'')`    |
| "         | `EscapeType::Char('"')`     |
| a         | `EscapeType::Char('\x07)`   |
| b         | `EscapeType::Char('\x08')`  |
| e         | `EscapeType::Char('\x1b')`  |
| f         | `EscapeType::Char('\x0c')`  |
| n         | `EscapeType::Char('\n')`    |
| r         | `EscapeType::Char('\r')`    |
| t         | `EscapeType::Char('\t')`    |
| v         | `EscapeType::Char('\x0b')`  |
| x         | `EscapeType::NakedByte`     |
| u         | `EscapeType::BracketU18`    |
| N         | `EscapeType::BracketUNamed` |
| ?         | `EscapeType::Char('?')`     |

Suppose we wished for `\d` to introduce a Unicode dagger symbol (U+2020 '†'). We could make that change as follows.

```rust,ignore
parser.borrow_string_parser()
    .escapes.insert('d', trivet::strings::EscapeType::Char('\u{2020}'));
```

Keep in mind that each time you use `set(StringStandard)` to change the string standard, you reset all options.

[^c]: "String" means something different for C than it does for Trivet, which relies on the Rust definition of UTF-8 encoded strings. For this reason there will be differences. In particular, C strings are really just null-terminated sequences of bytes. In Trivet strings can contain nulls and should be valid Unicode.