Crate lexer_rs

source ·
Expand description

Lexer library

This library provides a generic mechanism for parsing data into streams of tokens.

This is commonly used in human-readable language compilers and interpreters, to convert from a text stream into values that can then be parsed according to the grammar of that language.§

A simple example would be for a calculator that operates on a stream of numbers and mathematical symbols; the first step of processing that the calculator must do is to convert the text stream into abstract tokens such as ‘the number 73’ and ‘the plus sign’. Once the calculator has such tokens it can piece them together into a real expression that it can then evaluate.

Basic concept

The basic concept of a lexer is to convert a stream of (e.g.) char into a stream of ‘Token’ - which will be specific to the lexer. The lexer starts at the beginning of the text, and moves through consuming characters into tokens.

Lexer implementations

A lexer is not difficult to implement, and there are many alternative approaches to doing so. A very simple approach for a String would be to have a loop that matches the start of the string with possible token values (perhaps using a regular expression), and on finding a match it can ‘trim’ the front of the String, yield the token, and then loop again.

This library provides an implementation option that gives the ability to provide good error messages when things go wrong; it provides a trait that allows abstraction of the lexer from the consumer (so that one can get streams of tokens from a String, a BufRead, etc.); it provides the infrastructure for any lexer using a simple mechanism for parsing tokens.

Positions in files

The crate provides some mechanisms for tracking the position of parsing within a stream, so that error messages can be appropriately crafted for the end user.

Tracking the position as a minimum is following the byte offset within the file; additionally the line number and column number can also be tracked. The UserPosn trait provides for this.

As Rust utilizes UTF8 encoded strings, not all byte offsets correspond to actual chars in a stream, and the column separation between two characters is not the difference between their byte offsets. The PosnInCharStream adds to the UserPosn trait to manage this.

The bare minimum for a lexer handling UTF8-encoded strings does not require tracking of lines and columns; only the byte offset tracking has to be used; using a usize as the PosnInCharStream implementation provides for this (as the byte offset within a str.

The Lexer trait thus has an associated stream position type (its ‘State’): this must be lightweight as it is moved around and copied frequently, and must be static.

Tokens

The token type that the Lexer trait produces from its parsing is supplied by the client; this is normally a simple enumeration.

The parsing is managed by the Lexer with the client providing a slice of matching functions; each matching function is applied in turn, and the first that returns an Ok of a Some of a token yields the token and advances the parsing state. The parsers can generate an error if they detect a real error in the stream (not just a mismatch to their token type).

Error reporting

With the file position handling used within the Lexer it is possible to display contextual error information - so if the whole text is retained by the Lexer then an error can be displayed with the text from the source with the error point/region highlighted.

Support for this is provided by the FmtContext trait, which is implemented particularly for LexerOfString.

!

Structs

  • A Lexer of a str, using an arbitrary stream position type, lexer token, and lexer error.
  • This provides a type that wraps an allocated String, and which tracks the lines within the string. It then provides a method to create a LexerOfStr that borrows the text, and which can the be used as a crate::Lexer.
  • A line and column within a text stream
  • An iterator over a Lexer presenting the parsed Tokens from it
  • A simple implementation of a type supporting LexerError
  • This provides the byte offset of a character within a stream, with an associated position that might also accurately provide line and column numbers of the position
  • This provides a span between two byte offsets within a stream; the start and end have an associated position that might also ccurately provide line and column numbers

Traits

  • The CharStream trait allows a stream of char to provide extraa methods
  • This trait is provided by types that wish to support context for (e.g.) error messages
  • The Lexer trait is provided by stream types that support parsing into tokens.
  • A trait required of an error within a Lexer - a char that does not match any token parser rust return an error, and this trait requires that such an error be provided
  • Trait for location within a character stream
  • Trait for location within a stream

Type Aliases