Blex is a lightweight lexing framework built in Rust based on the T-Lex framework. Blex is built around the concept of building a set of rules that process a set of tokens. Each rule is run throughout the input string successively, gradually transforming it into an output string of tokens.
## Tokens
The implementation of tokens itself is the idea most heavily borrowed from Rustuck. Tokens in Blex are represented as a contiguous range of characters from the original string along with a set of tags, themselves also represented as borrowed strings. For example, the string `let x: int = 2;` might be represented as:
```
"let": var_dec, keyword
"x": ident
":": colon, keyword
"int": ident
"=": eq, keyword
"2": int
```
Notice the lack of whitespace: keywords don't have to represent the entire string.
## Rules
Rules in Blex are any function that transforms a vector of tokens into an optional vector of tokens. In Rust, that refers to this trait bound: `Fn(Vec<Token>) -> Option<Vec<Token>>`. Rules may modify the input tokens without worrying about mutating the original string: tokens are cloned before the rules are applied to them (luckily, cloning is a very cheap operation for tokens, which are just a range and a few pointers).
Rules are processed with the `process_rule` and `process_rules` function. The `process_rule` function applies the rule starting on each token in a list. Rules are applied starting with a single token, which is wrapped in a `Vec`, passed into the function, and processed. If the function returns `None`, the starting token and the next token are combined into a `Vec` and passed into the function. If the function returns `Some`, the tokens passed in are replaced with the returned `Vec`.
## Rule Processing
Rules are processed on strings of tokens, but strings of characters are transformed into tokens using the `str_to_tokens` function. Each character of a string is turned into a corresponding token, and one tag is added with the content of the character, a holdover from Rustuck.
### An Example
Let's create a rule that takes two consecutive tokens that read `ab` and converts them into one token with the tag `c`. The rule would start out like this:
```
fn ab_rule(tokens: Vec<Token>) -> Option<Vec<Token>> {
}
```
Let's say we input the string `a b blex ab abab`. The string will be turned into these tokens:
```
"a": a;
" ": ;
"b": b;
" ": ;
"b": b;
"l": l;
"e": e;
"x": x;
" ": ;
"a": a;
"b": b;
" ": ;
"a": a;
"b": b;
"a": a;
"b": b;
"":
```
Notice the empty token at the end. We didn't type that: it was added automatically to give some rules, like those testing for words, a buffer.
Our rule will start by scanning each token individually. Remember that we are scanning for the pattern "ab". Let's chart out each possible route our rule could take.
- It will start by finding one token containing one character.
- If the token has the tag `a`, we should continue the rule to include the next token.
- Otherwise, we should stop the rule and return the token unchanged.
- If multiple tokens are found, based on the last rule, it must be a pair of tokens with the first one being `a`. Knowing that, there are two paths we can take:
- If the second token has the tag `b`, we should combine those tokens and give it the tag `c`.
- Otherwise, we should stop the rule and return the token unchanged.
Luckily, blex has idiomatic ways to express this. We'll start by `match`ing on `token_structure(&tokens)`, which has two options:
```
fn ab_rule(tokens: Vec<Token>) -> Option<Vec<Token>> {
match tokens_structure(&tokens) {
TokenStructure::Single(tok) => {
},
TokenStructure::Multiple => {
}
}
}
```
The `token_structure` function takes in a borrowed `Vec` of tokens. If it finds that the `Vec` holds only one token, it returns that token wrapped in a `Single`. Otherwise, it returns `Multiple`. This is a safe way to guarantee the existence of one token.
There is also an idiomatic way to test a token for certain tags in the `has_tag` method. We can use this to test both cases for the `a` and `b` tags respectively.
```
fn ab_rule(tokens: Vec<Token>) -> Option<Vec<Token>> {
match tokens_structure(&tokens) {
TokenStructure::Single(tok) => {
if tok.has_tag("a") {
// case 1
} else {
// case 2
}
},
TokenStructure::Multiple => {
if tokens[1].has_tag("b") {
// case 3
} else {
// case 4
}
}
}
}
```
The only thing left is the return values:
- In case 1, we want to continue the rule to the next token. We do this by returning `None`.
- In cases 2 and 4, we want to end the rule and return the tokens unchanged. We simply do this by returning `Some(tokens)`.
- In case 3, we want to combine both of our tokens and give them the tag `c`. Of course, there is an idiomatic way to do this as well: `wrap`. The `wrap` function takes a `Vec` of tokens (which are assumed to be from the same string of tokens) and combines all of their contents into one token. The second argument of `wrap` is a `Vec<&str>`, which contains all of the tags to add to the new token.
```
fn ab_rule(tokens: Vec<Token>) -> Option<Vec<Token>> {
match tokens_structure(&tokens) {
TokenStructure::Single(tok) => {
if tok.has_tag("a") {
None
} else {
Some(tokens)
}
},
TokenStructure::Multiple => {
if tokens[1].has_tag("b") {
Some(vec![wrap(tokens, vec!["c"])])
} else {
Some(tokens)
}
}
}
}
```
That completes our rule! We can test it with a script like this:
```
#[test]
fn apply_ab() {
let text = "a b blex ab abab";
let mut body = str_to_tokens(text);
process_rule(ab_rule, &mut body);
print_tokens(body);
}
```
Which gives us:
```
"a": a;
" ": ;
"b": b;
" ": ;
"b": b;
"l": l;
"e": e;
"x": x;
" ": ;
"ab": c;
" ": ;
"ab": c;
"ab": c;
"":
```