# ShiftKit
A generic LALR(1) parser generator with a flat, index-based AST representation.
## Design Philosophy
Unlike traditional parser generators that produce recursive AST structures with `Box<AstNode>` references, ShiftKit takes a different approach: **the parser returns a flat `Vec` of AST nodes**, where nodes reference their children by index rather than by pointer.
This design offers several advantages:
- **Memory efficiency**: No pointer overhead, better cache locality
- **Predictable memory layout**: All nodes stored contiguously
- **Easy traversal**: Iterate through the entire AST with a simple loop
## Key Concepts
### Token Types
Tokens require unique IDs for grammar definitions via **`TokenType(u32)`**.
Your custom token type must implement:
```rust
pub trait HasTokenType {
fn token_type(&self) -> TokenType;
}
```
This lets you define tokens however you want (enums, structs, etc.) while the parser uses `token_type()` to identify them during parsing.
### AST Node Types - Grammar Only
**`AstNodeType(u32)`** is used **only for defining grammar rules**, not for the AST nodes themselves.
**Key insight:** The parser tracks which `AstNodeType` each node has internally based on which grammar rule created it.
Your AST nodes don't need to store or know their type - that's the parser's job.
This means your AST can be extremely simple:
```rust
#[derive(Debug, Clone)]
enum AstNode {
Number(i64),
BinOp(AstNodeId, BinOpType, AstNodeId),
}
```
You control precedence through grammar structure (e.g., separate `VALUE`, `PRODUCT`, `SUM` node types in the grammar), but all binary operations map to the same simple `BinOp` variant in your AST.
### Reduction Results: New Nodes vs Pass-Through
Reduction functions receive indices corresponding to the grammar rule symbols and return a `ReductionResult`:
```rust
// Reduction function signature
type ReductionFn<T, A> = fn(indices: &[Index], tokens: &[T], nodes: &[A]) -> ReductionResult<A>;
// Index is either a token position or AST node position
pub enum Index {
Token(TokenId),
AstNode(AstNodeId),
}
// Result is either a new node or a forwarded node
pub enum ReductionResult<A> {
/// Create a new AST node and append it to the Vec
NewNode(A),
/// Reuse an existing AST node (for pass-through rules like `Sum -> Product`)
Forward(AstNodeId),
}
```
This avoids duplicating nodes in the Vec for pass-through grammar rules.
### Putting It Together
The key design insights:
1. **`AstNodeType` for precedence**: The grammar uses types like `NODE_VALUE`, `NODE_PRODUCT`, `NODE_SUM` to encode precedence, but your actual AST is simple (just `Number` and `BinOp` in this example).
2. **Parser tracks types internally**: When a reduction fires, the parser knows which `AstNodeType` it produced (from the grammar rule's left-hand side).
You never need to store or compute this in your AST.
3. **`ReductionResult` enables pass-through**: Rules like `Sum -> Product` just forward the existing node instead of cloning it, keeping the Vec compact.
4. **Simple AST, complex grammar**: Control precedence and parsing behavior through grammar structure, while keeping your AST focused on semantics.
### Index-Based AST Structure
When the parser processes input, it:
1. Takes a slice of tokens as input
2. Returns a `Vec` of AST nodes
3. The last element in the Vec is the root/outermost node
4. Child nodes appear earlier in the Vec
AST nodes reference their children using **`AstNodeId`** (index into the AST Vec) and can optionally reference **`TokenId`** (index into the token slice).
## Usage
### 1. Define Your Token and AST Node Types
**Tokens** need `HasTokenType` implemented.
```rust
use shiftkit::{TokenType, HasTokenType, AstNodeId};
// Token type IDs (for grammar definition)
const TOKEN_NUMBER: TokenType = TokenType(0);
const TOKEN_PLUS: TokenType = TokenType(1);
const TOKEN_MINUS: TokenType = TokenType(2);
const TOKEN_STAR: TokenType = TokenType(3);
const TOKEN_SLASH: TokenType = TokenType(4);
const TOKEN_LPAREN: TokenType = TokenType(5);
const TOKEN_RPAREN: TokenType = TokenType(6);
// Your custom token type
#[derive(Debug, Clone)]
enum Token {
Number(i64),
Plus,
Minus,
Star,
Slash,
LParen,
RParen,
}
impl HasTokenType for Token {
fn token_type(&self) -> TokenType {
match self {
Self::Number(_) => TOKEN_NUMBER,
Self::Plus => TOKEN_PLUS,
Self::Minus => TOKEN_MINUS,
Self::Star => TOKEN_STAR,
Self::Slash => TOKEN_SLASH,
Self::LParen => TOKEN_LPAREN,
Self::RParen => TOKEN_RPAREN,
}
}
}
// Your custom AST node types
#[derive(Debug, Clone)]
enum BinOpType {
Add,
Sub,
Mul,
Div,
}
#[derive(Debug, Clone)]
enum AstNode {
Number(i64),
BinOp(AstNodeId, BinOpType, AstNodeId),
}
```
### 2. Define Production Rules
Control precedence through grammar structure, not AST structure:
```rust
use shiftkit::{Grammar, AstNodeType, ReductionResult, Index};
// AST node types (for grammar only - not stored in actual AST!)
const NODE_VALUE: AstNodeType = AstNodeType(0); // Atoms: numbers, parenthesized exprs
const NODE_PRODUCT: AstNodeType = AstNodeType(1); // *, / (higher precedence)
const NODE_SUM: AstNodeType = AstNodeType(2); // +, - (lower precedence)
let mut grammar = Grammar::new();
// Value -> NUMBER
grammar.add_rule(
NODE_VALUE,
&[TOKEN_NUMBER.into()],
|indices: &[Index], tokens: &[Token], nodes: &[AstNode]| {
let tok_id = indices[0].as_token_id();
let Token::Number(num) = tokens[tok_id] else { unreachable!() };
ReductionResult::NewNode(AstNode::Number(num))
}
);
// Value -> ( Sum )
grammar.add_rule(
NODE_VALUE,
&[TOKEN_LPAREN.into(), NODE_SUM.into(), TOKEN_RPAREN.into()],
|indices: &[Index], tokens: &[Token], nodes: &[AstNode]| {
// Just forward the inner expression - don't duplicate it!
ReductionResult::Forward(indices[1].as_ast_node_id())
}
);
// Product -> Value (pass-through)
grammar.add_rule(
NODE_PRODUCT,
&[NODE_VALUE.into()],
|indices: &[Index], tokens: &[Token], nodes: &[AstNode]| {
ReductionResult::Forward(indices[0].as_ast_node_id())
}
);
// Product -> Product * Value
grammar.add_rule(
NODE_PRODUCT,
&[NODE_PRODUCT.into(), TOKEN_STAR.into(), NODE_VALUE.into()],
|indices: &[Index], tokens: &[Token], nodes: &[AstNode]| {
let lhs = indices[0].as_ast_node_id();
let rhs = indices[2].as_ast_node_id();
ReductionResult::NewNode(AstNode::BinOp(lhs, BinOpType::Mul, rhs))
}
);
// Sum -> Product (pass-through)
grammar.add_rule(
NODE_SUM,
&[NODE_PRODUCT.into()],
|indices: &[Index], tokens: &[Token], nodes: &[AstNode]| {
ReductionResult::Forward(indices[0].as_ast_node_id())
}
);
// Sum -> Sum + Product
grammar.add_rule(
NODE_SUM,
&[NODE_SUM.into(), TOKEN_PLUS.into(), NODE_PRODUCT.into()],
|indices: &[Index], tokens: &[Token], nodes: &[AstNode]| {
let lhs = indices[0].as_ast_node_id();
let rhs = indices[2].as_ast_node_id();
ReductionResult::NewNode(AstNode::BinOp(lhs, BinOpType::Add, rhs))
}
);
```
**Key points:**
- `AstNodeType` variants (`NODE_VALUE`, `NODE_PRODUCT`, `NODE_SUM`) encode precedence in the grammar
- Your actual `AstNode` enum is simple - just `Number` and `BinOp`
- Use `ReductionResult::Forward()` for pass-through rules to avoid duplicating nodes
- Use `ReductionResult::NewNode()` to create new AST nodes
### 3. Parse Input
Build the parser from your grammar and use it to parse token slices:
```rust
use shiftkit::Parser;
// Build the LALR(1) parser from the grammar
let parser: Parser<Token, AstNode> = Parser::from_grammar(grammar)?;
// Parse your input tokens
let tokens = tokenize("1 + 2 * 3");
let ast_nodes = parser.parse(&tokens)?;
// The root node is at the end of the Vec
let root_id = ast_nodes.len() - 1;
// Traverse the AST using indices
fn evaluate(node_id: AstNodeId, nodes: &[AstNode]) -> i64 {
match &nodes[node_id] {
AstNode::Number(n) => *n,
AstNode::BinOp(lhs, op, rhs) => {
let left = evaluate(*lhs, nodes);
let right = evaluate(*rhs, nodes);
match op {
BinOpType::Add => left + right,
BinOpType::Sub => left - right,
BinOpType::Mul => left * right,
BinOpType::Div => left / right,
}
}
}
}
let result = evaluate(root_id, &ast_nodes);
```
The parser is generic over **tokens** (requiring `HasTokenType`) and **any AST node type**:
```rust
pub struct Parser<T: HasTokenType, A> {
// ... parsing tables
}
impl<T: HasTokenType, A> Parser<T, A> {
pub fn parse(&self, tokens: &[T]) -> Result<Vec<A>, ParseError> {
// Uses `token.token_type()` for parsing decisions
// Tracks `AstNodeType` internally based on which grammar rules fire
}
}
```
## Benefits
### Memory Efficiency
- No `Box` allocations for each node
- Contiguous memory layout improves cache performance
- Smaller memory footprint for large ASTs
### Simplicity
- No lifetime management for AST references
- Easy to implement `Copy` semantics for indices
- Straightforward serialization (just write the Vec)
### Flexibility
- Easy to implement multiple passes over the AST
- Can maintain separate metadata vectors indexed by node ID
- Simple to implement AST transformations (just modify the Vec)
### Debugging
- Print the entire AST as a simple Vec
- Node indices provide stable identifiers across transformations
- Easy to visualize the parsing order (nodes are added in reduction order)
## Example: Complete Calculator Grammar
A full example showing precedence control with a simple AST:
```rust
use shiftkit::{Grammar, TokenType, AstNodeType, ReductionResult, Index, AstNodeId};
// Token type IDs (for grammar)
const TOKEN_NUMBER: TokenType = TokenType(0);
const TOKEN_PLUS: TokenType = TokenType(1);
const TOKEN_MINUS: TokenType = TokenType(2);
const TOKEN_STAR: TokenType = TokenType(3);
const TOKEN_SLASH: TokenType = TokenType(4);
const TOKEN_LPAREN: TokenType = TokenType(5);
const TOKEN_RPAREN: TokenType = TokenType(6);
// AST node type IDs (for grammar only - control precedence)
const NODE_VALUE: AstNodeType = AstNodeType(0); // Atoms
const NODE_PRODUCT: AstNodeType = AstNodeType(1); // *, /
const NODE_SUM: AstNodeType = AstNodeType(2); // +, -
// Your actual AST types (no precedence info needed!)
#[derive(Debug, Clone)]
enum BinOpType { Add, Sub, Mul, Div }
#[derive(Debug, Clone)]
enum AstNode {
Number(i64),
BinOp(AstNodeId, BinOpType, AstNodeId),
}
let mut grammar = Grammar::new();
// Value -> NUMBER
grammar.add_rule(NODE_VALUE, &[TOKEN_NUMBER.into()],
|indices, tokens, nodes| {
let Token::Number(n) = tokens[indices[0].as_token_id()] else { unreachable!() };
ReductionResult::NewNode(AstNode::Number(n))
}
);
// Value -> ( Sum )
grammar.add_rule(NODE_VALUE,
&[TOKEN_LPAREN.into(), NODE_SUM.into(), TOKEN_RPAREN.into()],
|indices, tokens, nodes| {
ReductionResult::Forward(indices[1].as_ast_node_id())
}
);
// Product -> Value
grammar.add_rule(NODE_PRODUCT, &[NODE_VALUE.into()],
|indices, tokens, nodes| {
ReductionResult::Forward(indices[0].as_ast_node_id())
}
);
// Product -> Product * Value
grammar.add_rule(NODE_PRODUCT,
&[NODE_PRODUCT.into(), TOKEN_STAR.into(), NODE_VALUE.into()],
|indices, tokens, nodes| {
ReductionResult::NewNode(AstNode::BinOp(
indices[0].as_ast_node_id(),
BinOpType::Mul,
indices[2].as_ast_node_id()
))
}
);
// Product -> Product / Value
grammar.add_rule(NODE_PRODUCT,
&[NODE_PRODUCT.into(), TOKEN_SLASH.into(), NODE_VALUE.into()],
|indices, tokens, nodes| {
ReductionResult::NewNode(AstNode::BinOp(
indices[0].as_ast_node_id(),
BinOpType::Div,
indices[2].as_ast_node_id()
))
}
);
// Sum -> Product
grammar.add_rule(NODE_SUM, &[NODE_PRODUCT.into()],
|indices, tokens, nodes| {
ReductionResult::Forward(indices[0].as_ast_node_id())
}
);
// Sum -> Sum + Product
grammar.add_rule(NODE_SUM,
&[NODE_SUM.into(), TOKEN_PLUS.into(), NODE_PRODUCT.into()],
|indices, tokens, nodes| {
ReductionResult::NewNode(AstNode::BinOp(
indices[0].as_ast_node_id(),
BinOpType::Add,
indices[2].as_ast_node_id()
))
}
);
// Sum -> Sum - Product
grammar.add_rule(NODE_SUM,
&[NODE_SUM.into(), TOKEN_MINUS.into(), NODE_PRODUCT.into()],
|indices, tokens, nodes| {
ReductionResult::NewNode(AstNode::BinOp(
indices[0].as_ast_node_id(),
BinOpType::Sub,
indices[2].as_ast_node_id()
))
}
);
```
**Result:** Full precedence control (parentheses, `*`/`/`, `+`/`-`) with a simple two-variant AST!
## License
MIT