Expand description
Context-aware Perl lexer with mode-based tokenization
This crate provides a high-performance lexer for Perl that handles the inherently
context-sensitive nature of the language. The lexer uses a mode-tracking system to
correctly disambiguate ambiguous syntax like / (division vs. regex) and properly
parse complex constructs like heredocs, quote-like operators, and nested delimiters.
§Architecture
The lexer is organized around several key concepts:
- Mode Tracking:
LexerModetracks whether the parser expects a term or an operator, enabling correct disambiguation of context-sensitive tokens. - Checkpointing:
LexerCheckpointandCheckpointablesupport incremental parsing by allowing the lexer state to be saved and restored. - Budget Limits: Protection against pathological input with configurable size limits for regex patterns, heredoc bodies, and delimiter nesting depth.
- Position Tracking:
Positionmaintains line/column information for error reporting and LSP integration. - Unicode Support: Full Unicode identifier support following Perl 5.14+ semantics.
§Usage
§Basic Tokenization
use perl_lexer::{PerlLexer, TokenType};
let mut lexer = PerlLexer::new("my $x = 42;");
let tokens = lexer.collect_tokens();
// First token is the keyword `my`
assert!(matches!(&tokens[0].token_type, TokenType::Keyword(k) if &**k == "my"));
// Tokens include variables, operators, literals, and EOF
assert!(matches!(&tokens.last().map(|t| &t.token_type), Some(TokenType::EOF)));§Context-Aware Parsing
The lexer automatically tracks context to disambiguate operators:
use perl_lexer::{PerlLexer, TokenType};
// Division operator (after a term)
let mut lexer = PerlLexer::new("42 / 2");
// Regex operator (at start of expression)
let mut lexer2 = PerlLexer::new("/pattern/");§Checkpointing for Incremental Parsing
use perl_lexer::{PerlLexer, Checkpointable};
let mut lexer = PerlLexer::new("my $x = 1;");
let checkpoint = lexer.checkpoint();
// Parse some tokens
let _ = lexer.next_token();
// Restore to checkpoint
lexer.restore(&checkpoint);§Configuration Options
use perl_lexer::{PerlLexer, LexerConfig};
let config = LexerConfig {
parse_interpolation: true, // Parse string interpolation
track_positions: true, // Track line/column positions
max_lookahead: 1024, // Maximum lookahead for disambiguation
};
let mut lexer = PerlLexer::with_config("my $x = 1;", config);§Context Sensitivity Examples
Perl’s grammar is highly context-sensitive. The lexer handles these cases:
- Division vs. Regex:
/is division after terms, regex at expression start - Modulo vs. Hash Sigil:
%is modulo after terms, hash sigil at expression start - Glob vs. Exponent:
**can be exponentiation or glob pattern start - Defined-or vs. Regex:
//is defined-or after terms, regex at expression start - Heredoc Markers:
<<can be left shift, here-doc, or numeric less-than-less-than
§Budget Limits
To prevent hangs on pathological input, the lexer enforces these limits:
- MAX_REGEX_BYTES: 64KB maximum for regex patterns
- MAX_HEREDOC_BYTES: 256KB maximum for heredoc bodies
- MAX_DELIM_NEST: 128 levels maximum nesting depth for delimiters
When limits are exceeded, the lexer emits an UnknownRest token preserving
all previously parsed symbols, allowing continued analysis.
§Integration with perl-parser
The lexer is designed to work seamlessly with perl_parser::Parser:
use perl_parser::Parser;
let code = "sub hello { print qq{Hello, world!\\n}; }";
let mut parser = Parser::new(code);
let ast = parser.parse()?;The parser automatically creates and manages a PerlLexer instance internally.
Re-exports§
pub use checkpoint::CheckpointCache;pub use checkpoint::Checkpointable;pub use checkpoint::LexerCheckpoint;pub use error::LexerError;pub use error::Result;pub use mode::LexerMode;pub use token::StringPart;pub use token::Token;pub use token::TokenType;
Modules§
- checkpoint
- Lexer checkpointing for incremental parsing
- error
- Error types for the Perl lexer
- mode
- Lexer modes for context-sensitive parsing
- token
- Token types and structures for the Perl lexer
Structs§
- Lexer
Config - Configuration for the lexer
- Perl
Lexer - Mode-aware Perl lexer
- Position
- A position in a source file with byte offset, line, and column