pub struct LexerState<'s, S: Source + ?Sized, L: Language> { /* private fields */ }Expand description
State information for incremental lexical analysis.
This struct maintains the current position and context during tokenization, enabling incremental and resumable lexing operations.
Implementations§
Source§impl<'s, S: Source + ?Sized, L: Language> LexerState<'s, S, L>
impl<'s, S: Source + ?Sized, L: Language> LexerState<'s, S, L>
Sourcepub fn new_with_cache(
source: &'s S,
relex_from: usize,
cache: &impl LexerCache<L>,
) -> Self
pub fn new_with_cache( source: &'s S, relex_from: usize, cache: &impl LexerCache<L>, ) -> Self
Creates a new lexer state with the given source text and incremental cache.
§Arguments
source- The source text to lexrelex_from- The minimum byte offset that may have been affected by edits (usesource.length()to indicate no edits)cache- The incremental cache containing previous lexing results
§Returns
A new LexerState initialized at the beginning of the source with cache support
Sourcepub fn rest(&mut self) -> &str
pub fn rest(&mut self) -> &str
Gets the remaining text from the current position to the end of the source.
§Returns
A string slice containing the remaining text
Sourcepub fn rest_bytes(&mut self) -> &[u8] ⓘ
pub fn rest_bytes(&mut self) -> &[u8] ⓘ
Gets the remaining text as a byte slice.
Sourcepub fn fully_reused(&self) -> bool
pub fn fully_reused(&self) -> bool
Checks if the lexer has consumed all input from the source.
Sourcepub fn get_position(&self) -> usize
pub fn get_position(&self) -> usize
Gets the current byte offset position in the source text.
§Returns
The current byte offset from the start of the source text
Sourcepub fn get_length(&self) -> usize
pub fn get_length(&self) -> usize
Gets the total length of the source text in bytes.
Sourcepub fn get_char_at(&self, offset: usize) -> Option<char>
pub fn get_char_at(&self, offset: usize) -> Option<char>
Gets a single character at the specified byte offset.
Sourcepub fn advance_byte(&mut self) -> Option<u8>
pub fn advance_byte(&mut self) -> Option<u8>
Advances the cursor by one byte and returns it.
Sourcepub fn take_while_byte(&mut self, pred: impl FnMut(u8) -> bool) -> Range<usize>
pub fn take_while_byte(&mut self, pred: impl FnMut(u8) -> bool) -> Range<usize>
Advances the cursor while the byte predicate is true.
Sourcepub fn skip_ascii_whitespace(&mut self) -> Range<usize>
pub fn skip_ascii_whitespace(&mut self) -> Range<usize>
Skips common ASCII whitespace using SIMD if possible.
Sourcepub fn skip_ascii_digits(&mut self) -> Range<usize>
pub fn skip_ascii_digits(&mut self) -> Range<usize>
Skips all ASCII digits at the current position.
Sourcepub fn skip_ascii_ident_continue(&mut self) -> Range<usize>
pub fn skip_ascii_ident_continue(&mut self) -> Range<usize>
Skips all characters that can continue an ASCII identifier.
Sourcepub fn skip_until(&mut self, target: u8) -> Range<usize>
pub fn skip_until(&mut self, target: u8) -> Range<usize>
Skips all characters until the target byte is encountered.
Sourcepub fn scan_ascii_identifier(&mut self, kind: L::TokenType) -> bool
pub fn scan_ascii_identifier(&mut self, kind: L::TokenType) -> bool
Scans an ASCII identifier (starts with alpha/, continues with alphanumeric/).
Sourcepub fn scan_line_comment(&mut self, kind: L::TokenType, prefix: &str) -> bool
pub fn scan_line_comment(&mut self, kind: L::TokenType, prefix: &str) -> bool
Scans a line comment starting with the given prefix.
Sourcepub fn scan_block_comment(
&mut self,
kind: L::TokenType,
start_seq: &str,
end_seq: &str,
) -> bool
pub fn scan_block_comment( &mut self, kind: L::TokenType, start_seq: &str, end_seq: &str, ) -> bool
Scans a block comment with given start and end sequences.
Sourcepub fn tokens(&self) -> &[Token<L::TokenType>]
pub fn tokens(&self) -> &[Token<L::TokenType>]
Gets a reference to the tokens collected so far.
§Returns
A slice of tokens collected during the lexing process
Sourcepub fn set_position(&mut self, offset: usize) -> usize
pub fn set_position(&mut self, offset: usize) -> usize
Sourcepub fn get_text_in(&self, range: Range<usize>) -> Cow<'_, str>
pub fn get_text_in(&self, range: Range<usize>) -> Cow<'_, str>
Returns the text in the specified range.
Sourcepub fn get_text_from(&self, offset: usize) -> Cow<'_, str>
pub fn get_text_from(&self, offset: usize) -> Cow<'_, str>
Returns the text from the specified offset to the end.
Sourcepub fn starts_with(&mut self, pattern: &str) -> bool
pub fn starts_with(&mut self, pattern: &str) -> bool
Checks if the source starts with the given pattern at the current position.
Sourcepub fn consume_if_starts_with(&mut self, pattern: &str) -> bool
pub fn consume_if_starts_with(&mut self, pattern: &str) -> bool
Consumes the pattern if it exists at the current position.
Sourcepub fn get_tokens(&self) -> &[Token<L::TokenType>]
pub fn get_tokens(&self) -> &[Token<L::TokenType>]
Gets the tokens collected so far in the lexer state.
§Returns
A slice of tokens collected during lexing
Sourcepub fn add_token(&mut self, kind: L::TokenType, start: usize, end: usize)
pub fn add_token(&mut self, kind: L::TokenType, start: usize, end: usize)
Adds a token to the lexer state.
§Arguments
kind- The kind of the tokenstart- The starting byte offset of the tokenend- The ending byte offset of the token
Sourcepub fn add_eof(&mut self)
pub fn add_eof(&mut self)
Adds an end-of-file token to the lexer state.
This method creates and adds an END_OF_STREAM token at the current position. It’s typically called when the lexer reaches the end of the source text to mark the termination of the token stream.
§Examples
#![feature(new_range_api)]
let source = SourceText::new("test");
let mut state = LexerState::<_, SimpleLanguage>::new(&source);
state.take_while(|_| true); // Advance to end
state.add_eof();
assert_eq!(state.tokens().len(), 1);
assert_eq!(state.tokens()[0].span, Range { start: 4, end: 4 });Sourcepub fn current(&mut self) -> Option<char>
pub fn current(&mut self) -> Option<char>
Gets the current character at the current position.
§Returns
The current character, or None if at the end of the source
Sourcepub fn peek(&mut self) -> Option<char>
pub fn peek(&mut self) -> Option<char>
Peeks at the next character without advancing the position.
§Returns
The next character, or None if at the end of the source
Sourcepub fn peek_next_n(&mut self, n: usize) -> Option<char>
pub fn peek_next_n(&mut self, n: usize) -> Option<char>
Sourcepub fn advance(&mut self, length: usize) -> usize
pub fn advance(&mut self, length: usize) -> usize
Advances the position by the specified number of bytes.
This method moves the lexer’s current position forward by the specified number of bytes. It’s commonly used after recognizing a token to move past the token’s characters.
§Arguments
length- The number of bytes to advance
§Returns
The new byte offset position after advancing
Sourcepub fn bump(&mut self) -> Option<char>
pub fn bump(&mut self) -> Option<char>
Advances the position by the current character’s length.
§Returns
The character that was skipped, or None if at the end of the source
Sourcepub fn advance_with(&mut self, token: Token<L::TokenType>) -> usize
pub fn advance_with(&mut self, token: Token<L::TokenType>) -> usize
Advances the position by the token’s length and adds the token to the lexer state.
This method combines two common operations: advancing the lexer position and adding a token to the token list. It calculates the advance distance from the token’s span, ensuring consistent positioning.
§Arguments
token- The token to add to the lexer state
§Returns
The new byte offset position after advancing
§Examples
#![feature(new_range_api)]
let source = SourceText::new("hello world");
let mut state = LexerState::<_, SimpleLanguage>::new(&source);
// Create a token for "hello"
let token = Token { kind: SimpleToken::Identifier, span: Range { start: 0, end: 5 } };
// Initially at position 0
assert_eq!(state.get_position(), 0);
// Advance and add the token
let new_pos = state.advance_with(token);
// Now at position 5 and token is added
assert_eq!(new_pos, 5);
assert_eq!(state.get_position(), 5);
assert_eq!(state.get_tokens().len(), 1);
assert_eq!(state.get_tokens()[0].kind, SimpleToken::Identifier);§Note
The caller must ensure that the token’s span is valid and that the advance does not split multi-byte UTF-8 characters. The token should be created with proper character boundaries.
Sourcepub fn take_while(&mut self, pred: impl FnMut(char) -> bool) -> Range<usize>
pub fn take_while(&mut self, pred: impl FnMut(char) -> bool) -> Range<usize>
Consumes characters while the predicate returns true, returning the consumed range.
This method iterates through the source text from the current position, consuming characters as long as the predicate function returns true. It’s commonly used for recognizing patterns like identifiers, numbers, or whitespace sequences.
§Arguments
pred- A closure that takes a character and returns true if the character should be consumed, false otherwise
§Returns
A byte range representing the span of consumed characters
§Examples
#![feature(new_range_api)]
let source = SourceText::new("hello123world");
let mut state = LexerState::<_, SimpleLanguage>::new(&source);
// Consume alphabetic characters
let range = state.take_while(|c| c.is_alphabetic());
// Should have consumed "hello"
assert_eq!(range, Range { start: 0, end: 5 });
assert_eq!(state.get_position(), 5);
// Consume numeric characters
let range = state.take_while(|c| c.is_numeric());
// Should have consumed "123"
assert_eq!(range, Range { start: 5, end: 8 });
assert_eq!(state.get_position(), 8);§Performance Note
This method operates on a character-by-character basis, which means it correctly handles multi-byte UTF-8 characters. For performance-critical code, consider using byte-based methods when working with ASCII-only text.
Sourcepub fn not_at_end(&self) -> bool
pub fn not_at_end(&self) -> bool
Checks if the lexer has not reached the end of the source text.
§Returns
true if not at the end of the source, false otherwise
Sourcepub fn advance_if_dead_lock(&mut self, safe_point: usize)
pub fn advance_if_dead_lock(&mut self, safe_point: usize)
Performs a safety check to prevent infinite loops during lexing.
This method ensures that the lexer always makes progress by forcing advancement when stuck at the same position. It’s used as a safeguard against infinite loops in lexer implementations.
The method compares the current position with a previously saved “safe point” position. If they’re the same, it means the lexer hasn’t made progress since that safe point, potentially indicating an infinite loop. In this case, the method forces advancement by at least one character.
§Arguments
safe_point- The position to check against for potential deadlock
§Examples
#![feature(new_range_api)]
let source = SourceText::new("test");
let mut state = LexerState::<_, SimpleLanguage>::new(&source);
// Save the current position as a safe point
let safe_point = state.get_position();
// In a real lexer, you would do some processing here
// If something went wrong and we didn't advance, this would prevent infinite loop
state.advance_if_dead_lock(safe_point);
// If we were stuck, we would have advanced by at least 1
assert!(state.get_position() >= safe_point);§Usage in Lexer Implementations
This method is typically used at the beginning or end of lexing loops:
loop {
let safe_point = state.get_position();
// Try to recognize a token
if let Some(token) = try_recognize_token(&mut state) {
// Success, continue loop
continue;
}
// If we get here, we didn't recognize anything
// This prevents infinite loops if recognition fails
state.advance_if_dead_lock(safe_point);
if state.not_at_end() {
// Continue trying to recognize tokens
continue;
} else {
// Reached end of source
break;
}
}Sourcepub fn finish(self, result: Result<(), OakError>) -> LexOutput<L>
pub fn finish(self, result: Result<(), OakError>) -> LexOutput<L>
Finishes lexing and returns the final output with tokens and diagnostics.
This method concludes the lexing process by converting the collected tokens
and errors into a LexOutput result. It takes a Result parameter that
represents the overall success or failure of the lexing operation.
If the result is Ok, the tokens are returned as the successful result.
If the result is Err, the error is returned as the failure result.
In both cases, any collected diagnostic errors are included in the output.
§Arguments
result- The result of the lexing operation (Ok for success, Err for failure)
§Returns
A LexOutput containing the tokens (if successful) and any diagnostic errors
§Examples
#![feature(new_range_api)]
let source = SourceText::new("test");
let mut state = LexerState::<_, SimpleLanguage>::new(&source);
// Add some tokens during lexing
state.add_token(SimpleToken::Identifier, 0, 4);
// Finish with successful result
let output = state.finish(Ok(()));
// Check the results
assert!(output.result.is_ok());
assert_eq!(output.result.unwrap().len(), 1);
assert_eq!(output.diagnostics.len(), 0);
// Example with error
let source2 = SourceText::new("test");
let mut state2 = LexerState::<_, SimpleLanguage>::new(&source2);
state2.add_error(OakError::custom_error("Test error"));
let output2 = state2.finish(Err(OakError::custom_error("Fatal error")));
// Check the results
assert!(output2.result.is_err());
assert_eq!(output2.diagnostics.len(), 1); // The added errorSourcepub fn finish_with_cache(
self,
result: Result<(), OakError>,
cache: &mut impl LexerCache<L>,
) -> LexOutput<L>
pub fn finish_with_cache( self, result: Result<(), OakError>, cache: &mut impl LexerCache<L>, ) -> LexOutput<L>
Finishes lexing and returns the final output with tokens, diagnostics, and updated cache.
This method is similar to finish but additionally updates the incremental cache
with the new tokens. It’s used for incremental lexing where the results need to
be cached for future reuse when the source text changes.
The method first creates the output in the same way as finish, then updates
the cache’s last_lex field with the new tokens. This enables the next call
to new_with_cache to reuse these tokens if the source text hasn’t changed.
§Arguments
result- The result of the lexing operation (Ok for success, Err for failure)cache- The incremental cache to update with the new tokens
§Returns
A LexOutput containing the tokens (if successful) and any diagnostic errors
§Examples
#![feature(new_range_api)]
let source = SourceText::new("test");
let mut state = LexerState::<_, SimpleLanguage>::new(&source);
// Create a cache for incremental lexing
let mut cache = ParseSession::<SimpleLanguage>::new(16);
// Add some tokens during lexing
state.add_token(SimpleToken::Identifier, 0, 4);
// Finish with cache update
let output = state.finish_with_cache(Ok(()), &mut cache);
// Check the results
assert!(output.result.is_ok());
assert_eq!(output.result.unwrap().len(), 1);§Incremental Lexing Workflow
This method is typically used as part of an incremental lexing workflow:
// First lexing
let mut state = LexerState::new_with_cache(source, source.length(), cache);
// ... lexing logic ...
let output = state.finish_with_cache(Ok(()), cache);
// Later, when source changes
let relex_from = calculate_min_affected_offset(old_source, new_source);
let mut state = LexerState::new_with_cache(new_source, relex_from, cache);
// ... lexing logic (reusing unchanged tokens) ...
let output = state.finish_with_cache(Ok(()), cache);