Expand description
Context-aware Perl lexer with mode-based tokenization
This crate provides a high-performance lexer for Perl that handles the inherently
context-sensitive nature of the language. The lexer uses a mode-tracking system to
correctly disambiguate ambiguous syntax like / (division vs. regex) and properly
parse complex constructs like heredocs, quote-like operators, and nested delimiters.
§Architecture
The lexer is organized around several key concepts:
- Mode Tracking:
LexerModetracks whether the parser expects a term or an operator, enabling correct disambiguation of context-sensitive tokens. - Checkpointing:
LexerCheckpointandCheckpointablesupport incremental parsing by allowing the lexer state to be saved and restored. - Budget Limits: Protection against pathological input with configurable size limits for regex patterns, heredoc bodies, and delimiter nesting depth.
- Position Tracking:
Positionmaintains line/column information for error reporting and LSP integration. - Unicode Support: Full Unicode identifier support following Perl 5.14+ semantics.
§Usage
§Basic Tokenization
use perl_lexer::{PerlLexer, TokenType};
let mut lexer = PerlLexer::new("my $x = 42;");
let tokens = lexer.collect_tokens();
// First token is the keyword `my`
assert!(matches!(&tokens[0].token_type, TokenType::Keyword(k) if &**k == "my"));
// Tokens include variables, operators, literals, and EOF
assert!(matches!(&tokens.last().map(|t| &t.token_type), Some(TokenType::EOF)));§Context-Aware Parsing
The lexer automatically tracks context to disambiguate operators:
use perl_lexer::{PerlLexer, TokenType};
// Division operator (after a term)
let mut lexer = PerlLexer::new("42 / 2");
// Regex operator (at start of expression)
let mut lexer2 = PerlLexer::new("/pattern/");§Checkpointing for Incremental Parsing
use perl_lexer::{PerlLexer, Checkpointable};
let mut lexer = PerlLexer::new("my $x = 1;");
let checkpoint = lexer.checkpoint();
// Parse some tokens
let _ = lexer.next_token();
// Restore to checkpoint
lexer.restore(&checkpoint);§Configuration Options
use perl_lexer::{PerlLexer, LexerConfig};
let config = LexerConfig {
parse_interpolation: true, // Parse string interpolation
track_positions: true, // Track line/column positions
max_lookahead: 1024, // Maximum lookahead for disambiguation
};
let mut lexer = PerlLexer::with_config("my $x = 1;", config);§Context Sensitivity Examples
Perl’s grammar is highly context-sensitive. The lexer handles these cases:
- Division vs. Regex:
/is division after terms, regex at expression start - Modulo vs. Hash Sigil:
%is modulo after terms, hash sigil at expression start - Glob vs. Exponent:
**can be exponentiation or glob pattern start - Defined-or vs. Regex:
//is defined-or after terms, regex at expression start - Heredoc Markers:
<<can be left shift, here-doc, or numeric less-than-less-than
§Budget Limits
To prevent hangs on pathological input, the lexer enforces these limits:
- MAX_REGEX_BYTES: 64KB maximum for regex patterns
- MAX_HEREDOC_BYTES: 256KB maximum for heredoc bodies
- MAX_DELIM_NEST: 128 levels maximum nesting depth for delimiters
- MAX_REGEX_PARSE_STEPS: 32K maximum scan iterations for regex literals
When limits are exceeded, the lexer emits an UnknownRest token preserving
all previously parsed symbols, allowing continued analysis.
§Integration with perl-parser
The lexer is designed to work seamlessly with perl_parser_core::Parser.
You rarely need to use the lexer directly – the parser creates and manages
a PerlLexer instance internally:
use perl_parser_core::Parser;
let code = r#"sub hello { print "Hello, world!\n"; }"#;
let mut parser = Parser::new(code);
let ast = parser.parse().expect("should parse");Re-exports§
pub use checkpoint::CheckpointCache;pub use checkpoint::Checkpointable;pub use checkpoint::LexerCheckpoint;pub use config::LexerConfig;pub use error::LexerError;pub use error::Result;pub use limits::MAX_REGEX_PARSE_STEPS;pub use mode::LexerMode;pub use token::StringPart;pub use token::Token;pub use token::TokenType;pub use api::*;
Modules§
- api
- Public API re-exports for
perl-lexerpost-collapse. - builtins
- Builtin function signatures and metadata for Perl.
- checkpoint
- Lexer checkpointing for incremental parsing.
- config
- error
- Error types for the Perl lexer
- keywords
- Canonical Perl keyword inventories and allocation-free lookup helpers.
- limits
- Lexer parse budgets and limits used for graceful degradation on pathological input.
- mode
- Lexer modes for context-sensitive parsing
- token
- Token types and structures for the Perl lexer.
- tokenizer
- Token utilities bridging raw lexer output to parser consumption.