Crate scnr

Source
Expand description

§scnr

The scnr crate is a library that provides lexical scanner for programming languages. It is designed to be used in a parser of a compiler or interpreter for a programming language or in similar tools that require lexical analysis, e.g. in a language server. It provides multiple scanner modes out of the box, which can be switched at runtime depending on the context of the input. A parser can use different modes for different parts of the input, e.g. to scan comments in one mode and code in another. The scanner is designed to be fast and efficient, and it is implemented with the help of finite state machines. To parse the given regular expressions, the crate uses the regex-syntax crate.

§Example with a simple pattern list

use scnr::ScannerBuilder;

static PATTERNS: &[&str] = &[
    r";",                    // Semicolon
    r"0|[1-9][0-9]*",        // Number
    r"//.*(\r\n|\r|\n)",     // Line comment
    r"/\*([^*]|\*[^/])*\*/", // Block comment
    r"[a-zA-Z_]\w*",         // Identifier
    r"=",                    // Assignment
];

const INPUT: &str = r#"
// This is a comment
a = 10;
b = 20;
/* This is a block comment
   that spans multiple lines */
c = a;
"#;

fn main() {
    let scanner = ScannerBuilder::new()
        .add_patterns(PATTERNS)
        .build()
        .expect("ScannerBuilder error");
    let find_iter = scanner.find_iter(INPUT);
    for ma in find_iter {
        println!("Match: {:?}: '{}'", ma, &INPUT[ma.span().range()]);
    }
}

The output of the example is:

Match: Match { token_type: 2, span: Span { start: 1, end: 22 } }: '// This is a comment
'
Match: Match { token_type: 4, span: Span { start: 22, end: 23 } }: 'a'
Match: Match { token_type: 5, span: Span { start: 24, end: 25 } }: '='
Match: Match { token_type: 1, span: Span { start: 26, end: 28 } }: '10'
Match: Match { token_type: 0, span: Span { start: 28, end: 29 } }: ';'
Match: Match { token_type: 4, span: Span { start: 30, end: 31 } }: 'b'
Match: Match { token_type: 5, span: Span { start: 32, end: 33 } }: '='
Match: Match { token_type: 1, span: Span { start: 34, end: 36 } }: '20'
Match: Match { token_type: 0, span: Span { start: 36, end: 37 } }: ';'
Match: Match { token_type: 3, span: Span { start: 38, end: 96 } }: '/* This is a block comment
   that spans multiple lines */'
Match: Match { token_type: 4, span: Span { start: 97, end: 98 } }: 'c'
Match: Match { token_type: 5, span: Span { start: 99, end: 100 } }: '='
Match: Match { token_type: 4, span: Span { start: 101, end: 102 } }: 'a'
Match: Match { token_type: 0, span: Span { start: 102, end: 103 } }: ';'

§Example with scanner modes and position information

use std::sync::LazyLock;

use scnr::{MatchExtIterator, Pattern, ScannerBuilder, ScannerMode};

static SCANNER_MODES: LazyLock<Vec<ScannerMode>> = LazyLock::new(|| {
    vec![
        ScannerMode::new(
            "INITIAL",
            vec![
                Pattern::new(r"\r\n|\r|\n".to_string(), 0),   // Newline
                Pattern::new(r"[a-zA-Z_]\w*".to_string(), 4), // Identifier
                Pattern::new(r#"""#.to_string(), 6),          // String delimiter
            ],
            vec![
                (6, 1), // Token "String delimiter" -> Mode "STRING"
            ],
        ),
        ScannerMode::new(
            "STRING",
            vec![
                Pattern::new(r#"""#.to_string(), 6),     // String delimiter
                Pattern::new(r#"[^"]+"#.to_string(), 5), // String content
            ],
            vec![
                (6, 0), // Token "String delimiter" -> Mode "INITIAL"
            ],
        ),
    ]
});

const INPUT: &str = r#"Id1 "1. String" "2. String""#;

fn main() {
    let scanner = ScannerBuilder::new()
        .add_scanner_modes(&SCANNER_MODES)
        .build()
        .expect("ScannerBuilder error");
    let find_iter = scanner.find_iter(INPUT).with_positions();
    for ma in find_iter {
        println!("{:?}: '{}'", ma, &INPUT[ma.span().range()]);
    }
}

The output of this example is:

MatchExt { token_type: 4, span: Span { start: 0, end: 3 }, start_position: Position { line: 1, column: 1 }, end_position: Position { line: 1, column: 4 } }: 'Id1'
MatchExt { token_type: 6, span: Span { start: 4, end: 5 }, start_position: Position { line: 1, column: 5 }, end_position: Position { line: 1, column: 6 } }: '"'
MatchExt { token_type: 5, span: Span { start: 5, end: 14 }, start_position: Position { line: 1, column: 6 }, end_position: Position { line: 1, column: 15 } }: '1. String'
MatchExt { token_type: 6, span: Span { start: 14, end: 15 }, start_position: Position { line: 1, column: 15 }, end_position: Position { line: 1, column: 16 } }: '"'
MatchExt { token_type: 6, span: Span { start: 16, end: 17 }, start_position: Position { line: 1, column: 17 }, end_position: Position { line: 1, column: 18 } }: '"'
MatchExt { token_type: 5, span: Span { start: 17, end: 26 }, start_position: Position { line: 1, column: 18 }, end_position: Position { line: 1, column: 27 } }: '2. String'
MatchExt { token_type: 6, span: Span { start: 26, end: 27 }, start_position: Position { line: 1, column: 27 }, end_position: Position { line: 1, column: 28 } }: '"'

§Crate features

The crate has the following features:

  • default: This is the default feature set. When it is enabled it uses the scnr crate’s own regex engine.

  • regex_automata: This feature is not enabled by default. It instructs the lib to use the crate regex_automata as regex engine.

Both features are mutually exclusive. You can enable one of them, but not both at the same time.

Enabling the default feature usually results in a slower scanner, but it is faster at compiling the regexes. The regex_automata feature is faster at scanning the input, but it is possibly slower at compiling the regexes. This depends on the size of your scanner modes, i.e. the number of regexes you use.

Structs§

FindMatches
An iterator over all non-overlapping matches.
Lookahead
A lookahead is a regular expression that restricts a match of a pattern so that it must be matched after the pattern.
Match
A match in the haystack.
MatchExt
A match with line and column information for start and end positions.
Pattern
A pattern that is used to match the input. The pattern is represented by a regular expression and a token type number. The token type number is used to identify the pattern in the scanner. The pattern also has an optional Lookahead.
Position
A position in the haystack. The position is represented by a line and column number. The line and column numbers are 1-based.
Scanner
A Scanner. It consists of multiple DFAs that are used to search for matches.
ScannerBuilder
A builder for creating a scanner.
ScannerMode
A scanner mode that can be used to scan specific parts of the input. It has a name and a set of patterns that are valid token types in this mode. The scanner mode can also have transitions to other scanner modes triggered by a token type.
ScnrError
The error type for the scrn crate.
Span
A span in an input string.
WithPositions
An iterator over all non-overlapping matches with positions.

Enums§

PeekResult
The result of a peek operation.
ScnrErrorKind
The error kind type.

Traits§

MatchExtIterator
An extension trait for iterators over matches.
PositionProvider
A trait for providing the line and column information of a given byte offset in the haystack. It also provides a method to set the offset of the char indices iterator.
ScannerModeSwitcher
A trait to switch between scanner modes.

Type Aliases§

Result
The result type for the scrn crate.