Crate enso_flexer[][src]

Expand description

This module exports the API for defining a simple lexer based on a deterministic finite state automaton.

Lexers defined using the Flexer are capable of lexing languages of significant complexity, and while they’re defined in a simple way (akin to regular grammars), they can work even with context-sensitive languages.

The process of defining a lexer involves the user doing the following:

  1. Creating a Lexer type that wraps the Flexer.
  2. Creating a State type, to hold the user-defined lexing state.
  3. Implementing State for the State type.
  4. Implementing Definition for the Lexer, along with lexing transition rules to lex the language.

The result of defining a lexer using the flexer is a hybrid of the code written using this library, and also the code that this library generates to specialize your lexer.

Writing a Lexer

As the Flexer is a library for writing lexers, it would be remiss of us not to include a worked example for how to define a lexer. The following example defines a lexer for a small language, and shows you how to integrate the flexer code generation step with your project’s build.

The Language

We’re going to define a lexer for a very simple language, represented by the following EBNF grammar.

a-word      = 'a'+;
b-word      = 'b'+;
word        = a-word | b-word;
space       = ' ';
spaced-word = space, word;
language    = word, spaced-word*;

The Lexer’s Output

Every lexer needs the ability to write a stream of tokens as its output. A flexer-based lexer can use any type that it wants as its output type, but this language is going to use a very simple Token type, wrapped into a TokenStream.

#[derive(Clone)]
pub enum Token {
    /// A word from the input, consisting of a sequence of all `a` or all `b`.
    Word(String),
    /// A token that the lexer is unable to recognise.
    Unrecognized(String)
}

#[derive(Clone,Default)]
pub struct TokenStream {
    tokens:Vec<Token>
}

impl TokenStream {
    pub fn push(&mut self,token:Token) {
        self.tokens.push(token)
    }
}

These tokens will be inserted into the token stream by our lexer as it recognises valid portions of our language.

Whatever you choose as the Output type of your lexer, it will need to implement both std::clone::Clone and std::default::Default.

The Lexer’s State

Every Flexer-based lexer operates over a state that holds all of the user-defined state information required to define the particular lexer. This state type must conform to the State trait, which defines important functionality that it must provide to the flexer.

In our language, we want to only be able to match words with a preceding space character once we’ve seen an initial word that doesn’t have one. To this end, we need a state in our lexer to record that we’ve ‘seen’ the first word. As required by the State trait, we also need to provide the flexer with an initial state, the state registry, and the bookmarks we use.

use enso_flexer::group;
use enso_flexer::prelude::reader::BookmarkManager;
use enso_flexer::State;


// === LexerState ===

#[derive(Debug)]
pub struct LexerState {
    /// The registry for groups in the lexer.
    lexer_states:group::Registry,
    /// The initial state of the lexer.
    initial_state:group::Identifier,
    /// The state entered when the first word has been seen.
    seen_first_word_state:group::Identifier,
    /// The bookmarks for this lexer.
    bookmarks:BookmarkManager
}

The flexer library provides useful functionality to help with defining your lexer state, such as group::Registry for containing the various states through which your lexer may transition, amd prelude::reader::BookmarkManager for storing bookmarks.

Bookmarks

In order to enable arbitrary lookahead, the flexer provides a system for “bookmarking” a point in the input stream so that the lexer may return to it later. In fact, this mechanism is used by default in the implementation to deal with overlapping rules, and so the prelude::reader::BookmarkManager provides some bookmarks for you by default.

As a user, however, you can define additional bookmarks as part of your state, and mark or return to them as part of your lexer’s transition functions (more on this below).

Now that we have our state type, we need to define an implementation of State for it. This is a mostly trivial exercise, but two functions (State::new() and State::specialize) require special attention. We’ll look at both below.

use enso_flexer::generate;
use enso_flexer::generate::GenError;
use enso_flexer::prelude::AnyLogger;

impl enso_flexer::State for LexerState {
    fn new(_logger:&impl AnyLogger) -> Self {
        // Here we construct all of the elements needed for our lexer state. This function can
        // contain arbitrarily complex logic and is only called once at initialization time.
        let mut lexer_states      = group::Registry::default();
        let initial_state         = lexer_states.define_group("ROOT",None);
        let seen_first_word_state = lexer_states.define_group("SEEN FIRST WORD",None);
        let bookmarks             = BookmarkManager::new();
        Self{lexer_states,initial_state,seen_first_word_state,bookmarks}
    }

    fn initial_state(&self) -> group::Identifier {
        self.initial_state
    }

    fn groups(&self) -> &group::Registry {
        &self.lexer_states
    }

    fn groups_mut(&mut self) -> &mut group::Registry {
        &mut self.lexer_states
    }

    fn bookmarks(&self) -> &BookmarkManager {
        &self.bookmarks
    }

    fn bookmarks_mut(&mut self) -> &mut BookmarkManager {
        &mut self.bookmarks
    }

    fn specialize(&self) -> Result<String,GenError> {
        // It is very important to pass both the type name of your lexer and your output
        // correctly here. This function should always be implemented as a call to the
        // below-used function.
        generate::specialize(self,"TestLexer","Token")
    }
}

Defining the Lexer Type

With our state type defined, we now have the prerequisites for defining the lexer itself!

The notion behind the way we define lexers in the flexer is to use a chain of std::ops::Deref implementations to make the disparate parts feel like a cohesive whole. The Flexer itself already implements deref to your state type, so all that remains is to do the following:

  1. Define your lexer struct itself, containing an instance of the Flexer, parametrised by your state and output types.
use enso_flexer::Flexer;
use enso_flexer::prelude::logger::Disabled;

type Logger = Disabled;


// === Lexer ===

pub struct Lexer {
   lexer:Flexer<LexerState,TokenStream,Logger>
}

You’ll note that the Flexer also takes a logging implementation from the Enso logging library as a type parameter. This lets the client of the library configure the behaviour of logging in their lexer. We recommend aliasing the current logger type (as shown above) for ease of use.

  1. Implement a new() function for your lexer.

impl Lexer {
    pub fn new() -> Self {
        let lexer = Flexer::new(Logger::new("Lexer"));
        Lexer{lexer}
    }
}
  1. Define std::ops::Deref and std::ops::DerefMut for your lexer.
use std::ops::Deref;
use std::ops::DerefMut;

impl Deref for Lexer {
    type Target = Flexer<LexerState,TokenStream,Logger> ;
    fn deref(&self) -> &Self::Target {
        &self.lexer
    }
}
impl DerefMut for Lexer {
    fn deref_mut(&mut self) -> &mut Self::Target {
        &mut self.lexer
    }
}

You’ll note that here we’ve instantiated the flexer with a Logger. This is used for providing debug information during development, and can be accessed from all scopes of your lexer. In release mode, however, logging calls at the “trace”, “debug”, and “info” levels are optimised away.

Defining the Lexing Rules

Flexer-based lexers operate by matching on a series of automata::pattern::Patterns that describe the language that it is trying to lex. It combines these patterns with “transition functions” that may execute arbitrary code when a pattern matches on the lexer’s input.

In order to define the lexing rules, we need to implement Definition for our lexer, particularly the Definition::define() function.

use enso_flexer::automata::pattern::Pattern;
use enso_flexer::group::Registry;
use enso_flexer;

impl enso_flexer::Definition for Lexer {
    fn define() -> Self {
        // First we instantiate our lexer. Definitions take place _directly_ on the lexer, and
        // manipulate runtime state.
        let mut lexer = Self::new();

        // Then, we define the patterns that we're going to use. For an overview of the p
        let a_word        = Pattern::char('a').many1();
        let b_word        = Pattern::char('b').many1();
        let space         = Pattern::char(' ');
        let spaced_a_word = &space >> &a_word;
        let spaced_b_word = &space >> &b_word;
        let any           = Pattern::any();
        let end           = Pattern::eof();

        // Next, we define groups of lexer rules. This uses the groups that we've defined in our
        // lexer's state, and the patterns we've defined above.
        let root_group_id = lexer.initial_state;
        let root_group    = lexer.groups_mut().group_mut(root_group_id);
        root_group.create_rule(&a_word,"self.on_first_word(reader)");
        root_group.create_rule(&b_word,"self.on_first_word(reader)");
        root_group.create_rule(&end,   "self.on_no_err_suffix_first_word(reader)");
        root_group.create_rule(&any,   "self.on_err_suffix_first_word(reader)");

        let seen_first_word_group_id = lexer.seen_first_word_state;
        let seen_first_word_group    = lexer.groups_mut().group_mut(seen_first_word_group_id);
        seen_first_word_group.create_rule(&spaced_a_word,"self.on_spaced_word(reader)");
        seen_first_word_group.create_rule(&spaced_b_word,"self.on_spaced_word(reader)");
        seen_first_word_group.create_rule(&end,          "self.on_no_err_suffix(reader)");
        seen_first_word_group.create_rule(&any,          "self.on_err_suffix(reader)");

        lexer
    }

    /// This function just returns the lexer's groups.
    fn groups(&self) -> &Registry {
        self.lexer.groups()
    }

    /// Code you want to run before lexing begins.
    fn set_up(&mut self) {}

    /// Code you want to run after lexing finishes.
    fn tear_down(&mut self) {}
}

Transition Functions

You may be wondering why the transition functions are specified as strings. This allows us to generate highly-efficient, specialized code for your lexer once you define it. More on this later.

A group::Group in the lexer is like a state that operates on a stack. A transition function can arbitrarily activate or deactivate a group on the flexer’s stack, allowing you to perform context-sensitive lexing behaviour. For more information (including on how to use parent groups to inherit rules), see the relevant module.

For more information on the automata::pattern::Pattern APIs used above, please see the relevant module in this crate.

Defining the Transition Functions

You’ll have noticed that, up above, we told the rules to use a bunch of transition functions that we’ve not yet talked about. These functions can be defined anywhere you like, as long as they are in scope in the file in which you are defining your lexer. We do, however, recommend defining them on your lexer itself, so they can access and manipulate lexer state, so that’s what we’re going to do here.

use enso_flexer::prelude::ReaderOps;

impl Lexer {
     pub fn on_first_word<R:ReaderOps>(&mut self, _reader:&mut R) {
        let str = self.current_match.clone();
        let ast = Token::Word(str);
        self.output.push(ast);
        let id = self.seen_first_word_state;
        self.push_state(id);
    }

    pub fn on_spaced_word<R:ReaderOps>(&mut self, _reader:&mut R) {
        let str = self.current_match.clone();
        let ast = Token::Word(String::from(str.trim()));
        self.output.push(ast);
    }

    pub fn on_err_suffix_first_word<R:ReaderOps>(&mut self, _reader:&mut R) {
        let ast = Token::Unrecognized(self.current_match.clone());
        self.output.push(ast);
    }

    pub fn on_err_suffix<R:ReaderOps>(&mut self, reader:&mut R) {
        self.on_err_suffix_first_word(reader);
        self.pop_state();
    }

    pub fn on_no_err_suffix_first_word<R:ReaderOps>(&mut self, _reader:&mut R) {}

    pub fn on_no_err_suffix<R:ReaderOps>(&mut self, reader:&mut R) {
        self.on_no_err_suffix_first_word(reader);
        self.pop_state();
    }
}

Magic Transition Functions

The transition functions are the ‘secret sauce’, so to speak, of the Flexer. They are called when a rule matches, and allow arbitrary code to manipulate the lexer. This means that the flexer can be used to define very complex grammars while still keeping a simple interface and ensuring performant execution.

You’ll note that all of these functions have a couple of things in common:

  1. They have a type parameter R that conforms to the [prelude::LazyReader] trait.
  2. They take an argument of type R, that is the reader over which the lexer is running as the first non-self argument to the function.
  3. Any additional arguments must be valid in the scope in which the specialisation rules are going to be generated.

Both of these, combined, allow the transition functions to manipulate the text being read by the lexer.

Specializing the Lexer

In order to actually use the lexer that you’ve defined, you need to specialize it to the rules that you define. Unfortunately, cargo doesn’t have support for post-build hooks, and so this is a little more involved than we’d like it to be.

  1. Create a file that performs the definition of the lexer as above. It can use multiple files in its crate as long as they are publicly exposed.
  2. Create a separate cargo project that has a prebuild hook in its build.rs.
  3. In that build.rs, you need to:
    1. Import the lexer definition and instantiate it using ::define().
    2. Call State::specialize() on the resultant lexer. This will generate a string that contains the optimised lexer implementation.
    3. Write both the generated code and the code from the original lexer definition into an output file.
  4. Re-export this output file from your cargo project’s lib.rs.

The process of specialization will generate quite a bit of code, but most importantly it will generate pub fn run<R:LazyReader>(&mut self, mut reader:R) -> Result<Output>, where Output is your lexer’s token type. All of these functions are defined on your lexer type (the one whose name is provided to specialize().

In Summary

The flexer allows its clients to define highly optimised lexer implementations that are capable of lexing languages of a high complexity.

Re-exports

pub use enso_automata as automata;

Modules

generate

This file contains utilities for generating rust code from lexer definitions, allowing the flexer to be specialised for a specific language.

group

This module provides an API for grouping multiple flexer rules.

prelude

Useful libraries for working with the flexer.

Macros

char

Quote a character as a character pattern.

literal

Quote a string as a literal pattern.

Structs

Flexer

The flexer is an engine for generating lexers.

LexingResult

The result of executing the lexer on a given input.

SubStateId

An identifier for a sub-state of the lexer to transition to.

Enums

ResultKind

The kind of lexer result.

StageStatus

The result of executing a single step of the DFA.

Traits

Definition

Allows for the definition of flexer-based lexers.

State

Contains the state needed by the flexer from a lexer implementation.