Expand description
§Performance & Incremental Parsing Guide
oak-core is designed with the philosophy that explicit, manual optimization by the implementer outperforms generic, automated strategies. While frameworks like Tree-sitter provide automatic incrementality, oak-core gives you the surgical tools to achieve much higher peak performance by leveraging domain knowledge of your specific grammar.
§Why oak is Fast
Oak achieves industry-leading performance through several key architectural decisions:
- Surgical Incrementalism: Unlike “all-or-nothing” incremental parsers, Oak allows you to define specific “Reuse Anchors” (
try_reuse). This means you only pay for the complexity you actually use, and reuse hit rates can be optimized manually for your specific language structure. - Specialized
SyntaxArena: We use a custom bump allocator that outperforms general-purpose allocators (likebumpalo) by enforcing Zero-Drop for POD AST nodes, using Fixed 8-byte Alignment, and maintaining a Thread-Local Chunk Pool to minimize global allocator pressure. - SIMD-Accelerated Lexing: Our lexer uses
portable_simdto scan source text 32 bytes at a time. Operations like skipping whitespace or finding delimiters are effectively $O(N/32)$. - Separated Green/Red Trees: By separating structural data (Green Nodes) from identity and position (Red Nodes), Oak maximizes memory sharing and cache locality during incremental updates.
- Compiler-Level Hints: We use
core_intrinsicslikelikely/unlikelyandtrusted_lento guide the Rust compiler in optimizing the most critical hot paths in the parser loop.
§Common Performance Pitfalls
If you are not seeing the expected performance, check for these common traps:
- Missing Reuse Anchors: The most common mistake is not calling
try_reuseat high-level nodes (e.g., functions, classes, blocks). Without these anchors, Oak defaults to a full re-parse. - Dirty Range Bloat: If you merge multiple small, disconnected edits into a single large
TextEditspan, you invalidate large sections of the tree unnecessarily. Keep your dirty ranges as tight as possible. - Lexer Bottleneck: If your lexer re-scans the entire file on every edit, it becomes an $O(N)$ bottleneck that incremental parsing cannot solve. Use an anchor-based or stateful lexer that only re-lexes the changed portion.
- Long-Distance Lookahead: Using
try_reuseon nodes that depend on context far outside their own span (e.g., a node whose meaning changes based on a token 1000 lines away) will cause frequent reuse failures or incorrect trees. - Arena Fragmentation: Repeatedly re-parsing without successful reuse can lead to high memory churn. Ensure your grammar is structured to maximize
try_reusesuccess at the highest possible levels. - Excessive Checkpointing: While checkpoints are necessary, creating them for every single tiny node can add overhead. Focus checkpoints on meaningful nodes that are candidates for reuse.
§Requirements for Implementers
To achieve maximum performance and efficient incremental reuse, implementers must follow these requirements:
- Explicit Reuse Anchors: You must manually call
state.try_reuse(Kind)at high-level grammar nodes that are likely to remain unchanged (e.g., Function definitions, Struct items, Statement blocks). - Stateless Lookahead: Ensure that the nodes you attempt to reuse do not depend on long-distance context (lookahead) that exceeds the node’s boundaries. If a node’s meaning changes based on tokens far outside of it, incremental reuse may lead to incorrect trees.
- Strict Checkpointing: Always use
state.checkpoint()andstate.finish_at()to wrap nodes. This ensures the internal tree sink maintains a clean stack, which is critical for thetry_reuselogic to correctly skip tokens. - Token Synchronization: Your Lexer should produce tokens with accurate spans. Incremental parsing relies on mapping the new source offsets back to the old tree’s coordinate system.
§Optimization Strategies
§1. Granular Reuse
Instead of trying to reuse the entire source file, break your grammar into independent components. For example, in Rust:
- Try reuse at the Item level (Functions, Structs).
- Try reuse at the Statement level inside blocks.
- Try reuse at the Expression level for large literals or complex sub-trees.
§2. Anchor-based Lexing
Implement a lexer that can “resync.” When an edit occurs, find the nearest stable anchor (like a newline or a specific keyword) and only re-lex from that point until the token stream aligns with the previous generation.
§3. Multi-segment Dirty Tracking
Currently, a single dirty_range is used. For advanced scenarios, implement a multi-segment tracker that allows the parser to “jump” over multiple edited areas while reusing the stable code in between.
§4. Memory Locality (Node Promotion)
When a node is reused from the Old Arena, it stays there. If you have many generations of edits, your tree might be scattered across multiple arenas. Periodically performing a “Deep Clone” of reused nodes into the Active Arena can improve cache locality for subsequent traversals.
§5. Bypass Validation (Unsafe Reuse)
For specific, high-frequency nodes that you know are strictly isolated, you can implement a “Fast Path” that reuses nodes based purely on their Kind and Length, skipping the expensive token-by-token validation.
Re-exports§
pub use crate::builder::Builder;pub use crate::builder::BuilderCache;pub use crate::errors::OakDiagnostics;pub use crate::errors::OakError;pub use crate::errors::OakErrorKind;pub use crate::language::ElementRole;pub use crate::language::ElementType;pub use crate::language::Language;pub use crate::language::LanguageCategory;pub use crate::language::TokenRole;pub use crate::language::TokenType;pub use crate::language::UniversalElementRole;pub use crate::language::UniversalTokenRole;pub use crate::lexer::LexOutput;pub use crate::lexer::Lexer;pub use crate::lexer::LexerCache;pub use crate::lexer::LexerState;pub use crate::lexer::Token;pub use crate::lexer::Tokens;pub use crate::memory::arena::SyntaxArena;pub use crate::parser::Associativity;pub use crate::parser::OperatorInfo;pub use crate::parser::ParseCache;pub use crate::parser::ParseOutput;pub use crate::parser::ParseSession;pub use crate::parser::Parser;pub use crate::parser::ParserState;pub use crate::parser::Pratt;pub use crate::parser::PrattParser;pub use crate::parser::binary;pub use crate::parser::postfix;pub use crate::parser::state::TreeSink;pub use crate::parser::unary;pub use crate::source::Source;pub use crate::source::SourceText;pub use crate::source::TextEdit;pub use crate::tree::GreenNode;pub use crate::tree::GreenTree;pub use crate::tree::RedNode;pub use crate::tree::RedTree;
Modules§
- builder
- Incremental tree builder and cache management.
- errors
- Error handling and diagnostic reporting for the parsing system.
- helpers
- Helper utilities for common operations. Helper utilities for common operations in the Oak Core parsing framework.
- language
- Language definition trait for coordinating language-specific components.
- lexer
- Lexical analysis and tokenization functionality.
- memory
- Memory management utilities (Arena, Bump).
- parser
- Parsing functionality for converting tokens to kind trees.
- serde_
range - Serde support for
core::range::Range. - source
- Source text management and location tracking. Source text management and location tracking for incremental parsing.
- tree
- Tree structures for representing kind trees (green and red trees). Red-green tree implementation for efficient kind tree representation.
- visitor
- Tree traversal and transformation utilities. Tree traversal and transformation utilities.