[−][src]Crate sana
Sana is a crate that allows you to create lexers easily.
Example
use sana::{Sana, Spanned}; #[derive(Debug, Clone, Copy, PartialEq, Sana)] enum Token { #[regex("[a-zA-Z_][a-zA-Z0-9_]*")] Ident, #[regex("[0-9]+")] Integer, #[token("let", priority = 1)] Let, #[token("=")] Equals, #[regex(";")] Semicolon, #[regex("[ \t\r\n]")] Whitespace, #[error] Error, } let mut lexer = Token::lexer("let answer = 42;"); assert_eq!( lexer.next(), Some(Spanned { value: Token::Let, start: 0, end: 3 }) ); assert_eq!(lexer.next(), Some(Spanned { value: Token::Whitespace, start: 3, end: 4 }) ); assert_eq!(lexer.next(), Some(Spanned { value: Token::Ident, start: 4, end: 10 }) ); assert_eq!(lexer.next(), Some(Spanned { value: Token::Whitespace, start: 10, end: 11 }) ); assert_eq!( lexer.next(), Some(Spanned { value: Token::Equals, start: 11, end: 12 }) ); assert_eq!( lexer.next(), Some(Spanned { value: Token::Whitespace, start: 12, end: 13 }) ); assert_eq!( lexer.next(), Some(Spanned { value: Token::Integer, start: 13, end: 15 }) ); assert_eq!( lexer.next(), Some(Spanned { value: Token::Semicolon, start: 15, end: 16 }) ); // No tokens left assert_eq!( lexer.next(), None );
Lexers – how do they work?
A lexer is defined by a set of rules. A rule is a struct
{ regex, priority, action }
where regex
is a regular expression, priority
is the priority of the rule, and action
is the action that corresponds to
the token that the rule matches.
We say, that a string s
matches regular expression r
, if s
belongs to the
language (a set of strings) defined by r
. For example, if r = [a-z]
, then the
string "a"
matches r
, because "a" ∈ { "a", "b", ..., "z" }
but the string
"9"
does not match r
.
Note, that unlike many regular expression implementations like
regex, we consider only the full string match rather
than a substring match. So the string "abc"
does not match [a-z]
.
A string s
matches a rule set if the rule set contains at least one rule r
such as r.regex
matches s
.
A prefix set of a string is a set of all prefixes of that string. For example,
the prefix set of "let"
is { "l", "le", "let" }
.
The longest match of the string s
by a rule set R
is the longest string
s_max
in the prefix set of s
such as s_max
matches R
.
The job of the lexer is to find the longest match of the rule set and emit the action of the rule that matches the longest match. If there more than one such rules, the rule with the highest priority is selected.
Consider example above. The longest match of "let answer = 42;"
is "let"
.
There are two rules that match this string:
{ regex: "[a-zA-Z_][a-zA-Z0-9_]*", priority: 0, action: Ident }
{ regex: "let", priority: 1, action: Let }
.
From these rules, the Let
rule is selected, because it has the highest priority.
Implementation-wise, a lexer generator constructs a DFA to match a string by a rule set. The generated lexer consumes the input until it reaches the teminal state. While walking the DFA, lexer remembers the last seen action state. When the teminal state is reached, the lexer returns the last action state or an error, if the lexer visited no action states.
Structs
Lexer | The |
Spanned | A value (for example, token) together with its range |
Traits
Sana | Trait implemented for an enum representing all tokens. |
Derive Macros
Sana | Derives lexer for the given enum |