# Parsing Strings
By default, **Trivet** can process strings containing simple escapes like `\n`, ASCII escapes like `\x0d`, Unicode escapes like `\u{2020}`, and named Unicode escapes like `\N{dagger}`. The string parser is highly configurable.
To parse a string starting at the current position in the parse, you can use one of the following methods available in `trivet::Parser`.
| Method | Use |
| -------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `parse_string_match_delimiter() -> ParseResult<String>` | Parse a delimited string. Assume the parser is currently _at the opening delimiter_ and that the end delimiter should match. |
| `parse_string_until_delimiter(char) -> ParseResult<String>` | Parse a delimited string. Assume the opening delimiter has _already been consumed_ and the closing delimiter is given by the argument. |
| `parse_string(&str) -> ParseResult<String>` | Parse the provided string. This method creates a new `Parser` instance around the provided string and parses it. It is useful if you have some other method for capturing the string content and then want to parse what you captured. |
| `parse_string_match_delimiter_ws() -> ParseResult<String>` | Parse a delimited string. Assume the parser is currently _at the opening delimiter_ and the end delimiter should match. Consume any trailing whitespace. |
| `parse_string_until_delimiter_ws(char) -> ParseResult<String>` | Parse a delimited string. Assume the opening delimiter has _already been consumed_ and the closing delimiter is given by the argument. Consume any trailing whitespace. |
The important thing to keep in mind about these methods is whether or not the starting delimiter should be consumed. For `parse_string_match_delimiter() -> ParseResult<String>` you should _not_ consume the opening delimiter, as the method will use it to determine the closing delimiter. For `parse_string_until_delimiter(char) -> ParseResult<String>` you _should_ consume the starting delimiter or you will end up getting an empty string. None of this matters for `parse_string(&str) -> ParseResult<String>`, as it does not use delimiters.
The following is a short program to parse a series of strings from standard input and then print them out. The strings can be enclosed in double quotation marks, single quotation marks, or double-angle quotation marks (U+00AB '«' and U+00BB '»').
```rust,ignore
{{#include ../../examples/book_string_simple.rs}}
```
This might seem like a lot of code, but keep in mind that it transparently handles escapes in the strings.
A more sophisticated program for playing with strings and encodings can be found in the examples folder of the distribution in `stringy.rs`.
## String Parser
String parsing is actually performed by the struct `trivet::strings::StringParser`. A string parser is already installed in each instance of `trivet::Parser`, and you can obtain mutable access to it to configure it using `borrow_string_parser() -> &mut StringParser`.
| Method | Use |
| --------------------------------------------- | --------------------------------------------------------- |
| `borrow_string_parser() -> &mut StringParser` | Obtain a mutable reference to the internal string parser. |
There are several options discussed in the [string standards](#string-standards) and [configuration](#configuration) sections.
## String Standards
Several settings control string parsing and are discussed in the [configuration section](#configuration). To simplify things, you can set everything at once by selecting a _string standard_.
| Standard | Use |
| -------- | -------------------------------------------------- |
| Trivet | Use the **Trivet** standard. |
| Python | Try to parse strings in the same manner as Python. |
| Rust | Try to parse strings in the same manner as Rust. |
| JSON | Try to parse strings in the same manner as JSON. |
| C | Try to parse strings[^c] in the same manner as C. |
String standards are set on the `StringParser` instance. For example, the following will set the parser to the Rust standard.
```rust,ignore
use trivet;
let parser = trivet::parse_from_string("text");
parser.borrow_string_parser().set(trivet::strings::StringStandard::Rust);
```
String standards are provided by the `trivet::string::StringStandard` enum. Once you have selected a standard, you can feel free to modify any configuration settings you wish.
## Configuration
**Trivet** string parsing is highly configurable. For instance, you can configure the following.
- Whether escape characters are processed.
- What character introduces an escape.
- The various escape meanings.
- Whether "surrogate pairs" are allowed in string encoding.
- How to handle illegal Unicode.
- How to handle undefined escape characters.
The following are the configuration options (except escapes, discussed below). These must be accessed through the `StringParser` instance.
| Option | Trivet | Python | Rust | JSON | C |
| ------------------------------------------------------------- | ---------------------- | ---------------------- | ------- | ---------------------- | ---------------------- |
| `enable_escapes`<br>Whether to process escapes | `true` | `true` | `true` | `true` | `true` |
| `escape_char`<br>Character that introduces an escape | `\` | `\` | `\` | `\` | `\` |
| `unknown_escape_protocol`<br>Handling an unrecognized escape | `LiteralEscape` | `LiteralEscape` | `Error` | `DropEscape` | `LiteralEscape` |
| `allow_surrogate_pairs`<br>Whether to allow surrogate pairs | `true` | `false` | `false` | `true` | `false` |
| `illegal_unicode_protocol`<br>Handling illegal Unicode | `ReplacementCharacter` | `ReplacementCharacter` | `Error` | `ReplacementCharacter` | `ReplacementCharacter` |
| `allow_octal_escapes`<br>Whether octal escapes are allowed | `true` | `true` | `false` | `false` | `true` |
| `octal_escapes_are_flexible`<br>Allow fewer than three digits | `true` | `true` | `false` | `false` | `true` |
See the `trivet::strings::UnknownEscapeProtocol` and `trivet::strings::IllegalUnicodeProtocol` enums for their options.
The escapes supported by each standard are given by supplying a `std::collections::BTreeMap<char, EscapeType>` instance. This maps each a character to the escape type that it represents. For example, `n` is mapped to `EscapeType::Char('\n')`. Several escape types are supported. See the `trivet::strings::EscapeType` enum for details.
For example, the following is the escape specification for the **Trivet** string standard.
| Character | Escape Type |
| --------- | --------------------------- |
| \n | `EscapeType::Discard` |
| \ | `EscapeType::Char('\\')` |
| ' | `EscapeType::Char('\'')` |
| " | `EscapeType::Char('"')` |
| a | `EscapeType::Char('\x07)` |
| b | `EscapeType::Char('\x08')` |
| e | `EscapeType::Char('\x1b')` |
| f | `EscapeType::Char('\x0c')` |
| n | `EscapeType::Char('\n')` |
| r | `EscapeType::Char('\r')` |
| t | `EscapeType::Char('\t')` |
| v | `EscapeType::Char('\x0b')` |
| x | `EscapeType::NakedByte` |
| u | `EscapeType::BracketU18` |
| N | `EscapeType::BracketUNamed` |
| ? | `EscapeType::Char('?')` |
Suppose we wished for `\d` to introduce a Unicode dagger symbol (U+2020 '†'). We could make that change as follows.
```rust,ignore
parser.borrow_string_parser()
.escapes.insert('d', trivet::strings::EscapeType::Char('\u{2020}'));
```
Keep in mind that each time you use `set(StringStandard)` to change the string standard, you reset all options.
[^c]: "String" means something different for C than it does for Trivet, which relies on the Rust definition of UTF-8 encoded strings. For this reason there will be differences. In particular, C strings are really just null-terminated sequences of bytes. In Trivet strings can contain nulls and should be valid Unicode.