trivet 3.1.0

The trivet Parser Library
Documentation
# Implementation

## Goals

**Trivet** is built with the following primary goals in mind.

- No external dependencies other than the Rust standard library.
- No extra compilation steps.
- Support all common platforms (Windows, Linux, Mac).

Secondary goals are the following.

- Be fast.
- Be developer friendly.

## History

**Trivet** is based on an older library originally implemented for a Java/Scala term rewriting library. The parser for the library was slow and a bit problematic because of the desire to add user-defined syntax in the rewriting language, so several alternative parsers were developed. **Trivet** descends from the winning design.

There is a C version of this library that has been used for some embedded systems and has since been made open source. It does not support UTF-8 (it's ASCII) and is not actively maintained. Contact the author if you are interested in that.

## Structure

**Trivet** consists of two main user-facing modules.

- `trivet::parser` provides the `Parser` struct that performs most of the parsing
- `trivet::decoder` exists to correctly decode characters from a `std::io::Read` source

## Decoder

Users of the library should not need to interact with `trivet::decoder` directly, but it is made public because it is, on its own, interesting. It reads the first few bytes of the source and attempts to discover which of the following encodings is present.[^utf32]

- `UTF8` (Linux, Mac)
- `UTF16LE` (Windows)
- `UTF16BE` (maybe somewhere?)

After determining which encoding is correct from the byte order mark (BOM), or defaulting to the platform-appropriate one, it decodes the byte stream into characters and provides them on-demand to the parser via the `Decode` struct. This struct maintains a buffer and provides an iterator interface.

The implementation of the decoder is based on three primary sources.

- The [Encoding Standard][]
- The UTF-8 decoding state machine created by [Bjoern Hoehrmann][Hoehrmann]
- The [Unicode Standard][], Section 3.9

## Parser

The `trivet::parser` module is the main user-focused module. It maintains its own circular lookahead buffer, called the _unwind buffer_. This buffer is filled by the private `Parser::shift` method, which is the _only method that ever writes to it_ and also the _only method that accesses the decoder_. Whenever an attempt is made to peek at a byte that is not in the unwind buffer, the `Parser::shift` method is called to fill the buffer. As with most circular buffers, it keeps track of the contents using a pair of pointers: the index of the next character and the length of the buffer.

The size of the buffer controls the maximum possible lookahead, which is given by `parser::MAX_LOOKAHEAD` and is currently 64 KiB. That should be enough for anybody.[^deny]

The parser keeps track of whether the underlying source (the decoder) has been exhausted, and signals end of file when the unwind buffer is also empty.

The unwind buffer is manipulated by the `Parser::consume_n`, which is the _only method that drains the unwind buffer_. All other "consume" methods work through this one. This is also the _only method that updates the line and column numbers_ as characters are consumed.

The `Parser::peek` and `Parser::peek_n` methods access the unwind buffer, and may invoke `Parser::shift` to fill it. These are these are the _only methods that read from the unwind buffer_.

The mehods listed above constitute all the primitives; all other parsing methods in the library are built from those.

[Hoehrmann]: https://bjoern.hoehrmann.de/utf-8/decoder/dfa/
[Unicode Standard]: http://www.unicode.org/versions/Unicode11.0.0/ch03.pdf#G7404
[Encoding Standard]: https://encoding.spec.whatwg.org/

[^utf32]: The UTF-32 big and little endian encodings are very rare and not implemented.

[^deny]: The author will deny he ever said this.