Crate perl_lexer

Expand description

Context-aware Perl lexer with mode-based tokenization

This crate provides a high-performance lexer for Perl that handles the inherently context-sensitive nature of the language. The lexer uses a mode-tracking system to correctly disambiguate ambiguous syntax like / (division vs. regex) and properly parse complex constructs like heredocs, quote-like operators, and nested delimiters.

§Architecture

The lexer is organized around several key concepts:

Mode Tracking: LexerMode tracks whether the parser expects a term or an operator, enabling correct disambiguation of context-sensitive tokens.
Checkpointing: LexerCheckpoint and Checkpointable support incremental parsing by allowing the lexer state to be saved and restored.
Budget Limits: Protection against pathological input with configurable size limits for regex patterns, heredoc bodies, and delimiter nesting depth.
Position Tracking: Position maintains line/column information for error reporting and LSP integration.
Unicode Support: Full Unicode identifier support following Perl 5.14+ semantics.

§Usage

§Basic Tokenization

use perl_lexer::{PerlLexer, TokenType};

let mut lexer = PerlLexer::new("my $x = 42;");
let tokens = lexer.collect_tokens();

// First token is the keyword `my`
assert!(matches!(&tokens[0].token_type, TokenType::Keyword(k) if &**k == "my"));
// Tokens include variables, operators, literals, and EOF
assert!(matches!(&tokens.last().map(|t| &t.token_type), Some(TokenType::EOF)));

§Context-Aware Parsing

The lexer automatically tracks context to disambiguate operators:

use perl_lexer::{PerlLexer, TokenType};

// Division operator (after a term)
let mut lexer = PerlLexer::new("42 / 2");
// Regex operator (at start of expression)
let mut lexer2 = PerlLexer::new("/pattern/");

§Checkpointing for Incremental Parsing

use perl_lexer::{PerlLexer, Checkpointable};

let mut lexer = PerlLexer::new("my $x = 1;");
let checkpoint = lexer.checkpoint();

// Parse some tokens
let _ = lexer.next_token();

// Restore to checkpoint
lexer.restore(&checkpoint);

§Configuration Options

use perl_lexer::{PerlLexer, LexerConfig};

let config = LexerConfig {
    parse_interpolation: true,  // Parse string interpolation
    track_positions: true,      // Track line/column positions
    max_lookahead: 1024,        // Maximum lookahead for disambiguation
};

let mut lexer = PerlLexer::with_config("my $x = 1;", config);

§Context Sensitivity Examples

Perl’s grammar is highly context-sensitive. The lexer handles these cases:

Division vs. Regex: / is division after terms, regex at expression start
Modulo vs. Hash Sigil: % is modulo after terms, hash sigil at expression start
Glob vs. Exponent: ** can be exponentiation or glob pattern start
Defined-or vs. Regex: // is defined-or after terms, regex at expression start
Heredoc Markers: << can be left shift, here-doc, or numeric less-than-less-than

§Budget Limits

To prevent hangs on pathological input, the lexer enforces these limits:

MAX_REGEX_BYTES: 64KB maximum for regex patterns
MAX_HEREDOC_BYTES: 256KB maximum for heredoc bodies
MAX_DELIM_NEST: 128 levels maximum nesting depth for delimiters
MAX_REGEX_PARSE_STEPS: 32K maximum scan iterations for regex literals

When limits are exceeded, the lexer emits an UnknownRest token preserving all previously parsed symbols, allowing continued analysis.

§Integration with perl-parser

The lexer is designed to work seamlessly with perl_parser_core::Parser. You rarely need to use the lexer directly – the parser creates and manages a PerlLexer instance internally:

use perl_parser_core::Parser;

let code = r#"sub hello { print "Hello, world!\n"; }"#;
let mut parser = Parser::new(code);
let ast = parser.parse().expect("should parse");

Re-exports§

pub use checkpoint::CheckpointCache;
pub use checkpoint::Checkpointable;
pub use checkpoint::LexerCheckpoint;
pub use config::LexerConfig;
pub use error::LexerError;
pub use error::Result;
pub use limits::MAX_REGEX_PARSE_STEPS;
pub use mode::LexerMode;
pub use token::StringPart;
pub use token::Token;
pub use token::TokenType;
pub use api::*;

Modules§

api: Public API re-exports for perl-lexer post-collapse.
builtins: Builtin function signatures and metadata for Perl.
checkpoint: Lexer checkpointing for incremental parsing.
config
error: Error types for the Perl lexer
keywords: Canonical Perl keyword inventories and allocation-free lookup helpers.
limits: Lexer parse budgets and limits used for graceful degradation on pathological input.
mode: Lexer modes for context-sensitive parsing
token: Token types and structures for the Perl lexer.
tokenizer: Token utilities bridging raw lexer output to parser consumption.

Structs§

PerlLexer
Position: A position in a source file with byte offset, line, and column

Crate perl_lexer

Crate perl_lexer Copy item path

§Architecture

§Usage

§Basic Tokenization

§Context-Aware Parsing

§Checkpointing for Incremental Parsing

§Configuration Options

§Context Sensitivity Examples

§Budget Limits

§Integration with perl-parser

Re-exports§

Modules§

Structs§

Crate perl_lexer