shiftkit 0.2.0 - Docs.rs

# ShiftKit

A generic LALR(1) parser generator with a flat, index-based AST representation.

## Design Philosophy

Unlike traditional parser generators that produce recursive AST structures with `Box<AstNode>` references, ShiftKit takes a different approach: **the parser returns a flat `Vec` of AST nodes**, where nodes reference their children by index rather than by pointer.

This design offers several advantages:
- **Memory efficiency**: No pointer overhead, better cache locality
- **Predictable memory layout**: All nodes stored contiguously
- **Easy traversal**: Iterate through the entire AST with a simple loop

## Key Concepts

### Token Types

Tokens require unique IDs for grammar definitions via **`TokenType(u32)`**.
Your custom token type must implement:

```rust
pub trait HasTokenType {
    fn token_type(&self) -> TokenType;
}
```

This lets you define tokens however you want (enums, structs, etc.) while the parser uses `token_type()` to identify them during parsing.

### AST Node Types - Grammar Only

**`AstNodeType(u32)`** is used **only for defining grammar rules**, not for the AST nodes themselves.

**Key insight:** The parser tracks which `AstNodeType` each node has internally based on which grammar rule created it.
Your AST nodes don't need to store or know their type - that's the parser's job.

This means your AST can be extremely simple:

```rust
#[derive(Debug, Clone)]
enum AstNode {
    Number(i64),
    BinOp(AstNodeId, BinOpType, AstNodeId),
}
```

You control precedence through grammar structure (e.g., separate `VALUE`, `PRODUCT`, `SUM` node types in the grammar), but all binary operations map to the same simple `BinOp` variant in your AST.

### Reduction Results: New Nodes vs Pass-Through

Reduction functions receive indices corresponding to the grammar rule symbols and return a `ReductionResult`:

```rust
// Reduction function signature
type ReductionFn<T, A> = fn(indices: &[Index], tokens: &[T], nodes: &[A]) -> ReductionResult<A>;

// Index is either a token position or AST node position
pub enum Index {
    Token(TokenId),
    AstNode(AstNodeId),
}

// Result is either a new node or a forwarded node
pub enum ReductionResult<A> {
    /// Create a new AST node and append it to the Vec
    NewNode(A),
    /// Reuse an existing AST node (for pass-through rules like `Sum -> Product`)
    Forward(AstNodeId),
}
```

This avoids duplicating nodes in the Vec for pass-through grammar rules.

### Putting It Together

The key design insights:

1. **`AstNodeType` for precedence**: The grammar uses types like `NODE_VALUE`, `NODE_PRODUCT`, `NODE_SUM` to encode precedence, but your actual AST is simple (just `Number` and `BinOp` in this example).

2. **Parser tracks types internally**: When a reduction fires, the parser knows which `AstNodeType` it produced (from the grammar rule's left-hand side).
You never need to store or compute this in your AST.

3. **`ReductionResult` enables pass-through**: Rules like `Sum -> Product` just forward the existing node instead of cloning it, keeping the Vec compact.

4. **Simple AST, complex grammar**: Control precedence and parsing behavior through grammar structure, while keeping your AST focused on semantics.

### Index-Based AST Structure

When the parser processes input, it:
1. Takes a slice of tokens as input
2. Returns a `Vec` of AST nodes
3. The last element in the Vec is the root/outermost node
4. Child nodes appear earlier in the Vec

AST nodes reference their children using **`AstNodeId`** (index into the AST Vec) and can optionally reference **`TokenId`** (index into the token slice).

## Usage

### 1. Define Your Token and AST Node Types

**Tokens** need `HasTokenType` implemented.

```rust
use shiftkit::{TokenType, HasTokenType, AstNodeId};

// Token type IDs (for grammar definition)
const TOKEN_NUMBER: TokenType = TokenType(0);
const TOKEN_PLUS: TokenType = TokenType(1);
const TOKEN_MINUS: TokenType = TokenType(2);
const TOKEN_STAR: TokenType = TokenType(3);
const TOKEN_SLASH: TokenType = TokenType(4);
const TOKEN_LPAREN: TokenType = TokenType(5);
const TOKEN_RPAREN: TokenType = TokenType(6);

// Your custom token type
#[derive(Debug, Clone)]
enum Token {
    Number(i64),
    Plus,
    Minus,
    Star,
    Slash,
    LParen,
    RParen,
}

impl HasTokenType for Token {
    fn token_type(&self) -> TokenType {
        match self {
            Self::Number(_) => TOKEN_NUMBER,
            Self::Plus => TOKEN_PLUS,
            Self::Minus => TOKEN_MINUS,
            Self::Star => TOKEN_STAR,
            Self::Slash => TOKEN_SLASH,
            Self::LParen => TOKEN_LPAREN,
            Self::RParen => TOKEN_RPAREN,
        }
    }
}

// Your custom AST node types
#[derive(Debug, Clone)]
enum BinOpType {
    Add,
    Sub,
    Mul,
    Div,
}

#[derive(Debug, Clone)]
enum AstNode {
    Number(i64),
    BinOp(AstNodeId, BinOpType, AstNodeId),
}
```

### 2. Define Production Rules

Control precedence through grammar structure, not AST structure:

```rust
use shiftkit::{Grammar, AstNodeType, ReductionResult, Index};

// AST node types (for grammar only - not stored in actual AST!)
const NODE_VALUE: AstNodeType = AstNodeType(0);    // Atoms: numbers, parenthesized exprs
const NODE_PRODUCT: AstNodeType = AstNodeType(1);  // *, / (higher precedence)
const NODE_SUM: AstNodeType = AstNodeType(2);      // +, - (lower precedence)

let mut grammar = Grammar::new();

// Value -> NUMBER
grammar.add_rule(
    NODE_VALUE,
    &[TOKEN_NUMBER.into()],
    |indices: &[Index], tokens: &[Token], nodes: &[AstNode]| {
        let tok_id = indices[0].as_token_id();
        let Token::Number(num) = tokens[tok_id] else { unreachable!() };
        ReductionResult::NewNode(AstNode::Number(num))
    }
);

// Value -> ( Sum )
grammar.add_rule(
    NODE_VALUE,
    &[TOKEN_LPAREN.into(), NODE_SUM.into(), TOKEN_RPAREN.into()],
    |indices: &[Index], tokens: &[Token], nodes: &[AstNode]| {
        // Just forward the inner expression - don't duplicate it!
        ReductionResult::Forward(indices[1].as_ast_node_id())
    }
);

// Product -> Value (pass-through)
grammar.add_rule(
    NODE_PRODUCT,
    &[NODE_VALUE.into()],
    |indices: &[Index], tokens: &[Token], nodes: &[AstNode]| {
        ReductionResult::Forward(indices[0].as_ast_node_id())
    }
);

// Product -> Product * Value
grammar.add_rule(
    NODE_PRODUCT,
    &[NODE_PRODUCT.into(), TOKEN_STAR.into(), NODE_VALUE.into()],
    |indices: &[Index], tokens: &[Token], nodes: &[AstNode]| {
        let lhs = indices[0].as_ast_node_id();
        let rhs = indices[2].as_ast_node_id();
        ReductionResult::NewNode(AstNode::BinOp(lhs, BinOpType::Mul, rhs))
    }
);

// Sum -> Product (pass-through)
grammar.add_rule(
    NODE_SUM,
    &[NODE_PRODUCT.into()],
    |indices: &[Index], tokens: &[Token], nodes: &[AstNode]| {
        ReductionResult::Forward(indices[0].as_ast_node_id())
    }
);

// Sum -> Sum + Product
grammar.add_rule(
    NODE_SUM,
    &[NODE_SUM.into(), TOKEN_PLUS.into(), NODE_PRODUCT.into()],
    |indices: &[Index], tokens: &[Token], nodes: &[AstNode]| {
        let lhs = indices[0].as_ast_node_id();
        let rhs = indices[2].as_ast_node_id();
        ReductionResult::NewNode(AstNode::BinOp(lhs, BinOpType::Add, rhs))
    }
);
```

**Key points:**
- `AstNodeType` variants (`NODE_VALUE`, `NODE_PRODUCT`, `NODE_SUM`) encode precedence in the grammar
- Your actual `AstNode` enum is simple - just `Number` and `BinOp`
- Use `ReductionResult::Forward()` for pass-through rules to avoid duplicating nodes
- Use `ReductionResult::NewNode()` to create new AST nodes

### 3. Parse Input

Build the parser from your grammar and use it to parse token slices:

```rust
use shiftkit::Parser;

// Build the LALR(1) parser from the grammar
let parser: Parser<Token, AstNode> = Parser::from_grammar(grammar)?;

// Parse your input tokens
let tokens = tokenize("1 + 2 * 3");
let ast_nodes = parser.parse(&tokens)?;

// The root node is at the end of the Vec
let root_id = ast_nodes.len() - 1;

// Traverse the AST using indices
fn evaluate(node_id: AstNodeId, nodes: &[AstNode]) -> i64 {
    match &nodes[node_id] {
        AstNode::Number(n) => *n,
        AstNode::BinOp(lhs, op, rhs) => {
            let left = evaluate(*lhs, nodes);
            let right = evaluate(*rhs, nodes);
            match op {
                BinOpType::Add => left + right,
                BinOpType::Sub => left - right,
                BinOpType::Mul => left * right,
                BinOpType::Div => left / right,
            }
        }
    }
}

let result = evaluate(root_id, &ast_nodes);
```

The parser is generic over **tokens** (requiring `HasTokenType`) and **any AST node type**:

```rust
pub struct Parser<T: HasTokenType, A> {
    // ... parsing tables
}

impl<T: HasTokenType, A> Parser<T, A> {
    pub fn parse(&self, tokens: &[T]) -> Result<Vec<A>, ParseError> {
        // Uses `token.token_type()` for parsing decisions
        // Tracks `AstNodeType` internally based on which grammar rules fire
    }
}
```

## Benefits

### Memory Efficiency
- No `Box` allocations for each node
- Contiguous memory layout improves cache performance
- Smaller memory footprint for large ASTs

### Simplicity
- No lifetime management for AST references
- Easy to implement `Copy` semantics for indices
- Straightforward serialization (just write the Vec)

### Flexibility
- Easy to implement multiple passes over the AST
- Can maintain separate metadata vectors indexed by node ID
- Simple to implement AST transformations (just modify the Vec)

### Debugging
- Print the entire AST as a simple Vec
- Node indices provide stable identifiers across transformations
- Easy to visualize the parsing order (nodes are added in reduction order)

## Example: Complete Calculator Grammar

A full example showing precedence control with a simple AST:

```rust
use shiftkit::{Grammar, TokenType, AstNodeType, ReductionResult, Index, AstNodeId};

// Token type IDs (for grammar)
const TOKEN_NUMBER: TokenType = TokenType(0);
const TOKEN_PLUS: TokenType = TokenType(1);
const TOKEN_MINUS: TokenType = TokenType(2);
const TOKEN_STAR: TokenType = TokenType(3);
const TOKEN_SLASH: TokenType = TokenType(4);
const TOKEN_LPAREN: TokenType = TokenType(5);
const TOKEN_RPAREN: TokenType = TokenType(6);

// AST node type IDs (for grammar only - control precedence)
const NODE_VALUE: AstNodeType = AstNodeType(0);    // Atoms
const NODE_PRODUCT: AstNodeType = AstNodeType(1);  // *, /
const NODE_SUM: AstNodeType = AstNodeType(2);      // +, -

// Your actual AST types (no precedence info needed!)
#[derive(Debug, Clone)]
enum BinOpType { Add, Sub, Mul, Div }

#[derive(Debug, Clone)]
enum AstNode {
    Number(i64),
    BinOp(AstNodeId, BinOpType, AstNodeId),
}

let mut grammar = Grammar::new();

// Value -> NUMBER
grammar.add_rule(NODE_VALUE, &[TOKEN_NUMBER.into()],
    |indices, tokens, nodes| {
        let Token::Number(n) = tokens[indices[0].as_token_id()] else { unreachable!() };
        ReductionResult::NewNode(AstNode::Number(n))
    }
);

// Value -> ( Sum )
grammar.add_rule(NODE_VALUE,
    &[TOKEN_LPAREN.into(), NODE_SUM.into(), TOKEN_RPAREN.into()],
    |indices, tokens, nodes| {
        ReductionResult::Forward(indices[1].as_ast_node_id())
    }
);

// Product -> Value
grammar.add_rule(NODE_PRODUCT, &[NODE_VALUE.into()],
    |indices, tokens, nodes| {
        ReductionResult::Forward(indices[0].as_ast_node_id())
    }
);

// Product -> Product * Value
grammar.add_rule(NODE_PRODUCT,
    &[NODE_PRODUCT.into(), TOKEN_STAR.into(), NODE_VALUE.into()],
    |indices, tokens, nodes| {
        ReductionResult::NewNode(AstNode::BinOp(
            indices[0].as_ast_node_id(),
            BinOpType::Mul,
            indices[2].as_ast_node_id()
        ))
    }
);

// Product -> Product / Value
grammar.add_rule(NODE_PRODUCT,
    &[NODE_PRODUCT.into(), TOKEN_SLASH.into(), NODE_VALUE.into()],
    |indices, tokens, nodes| {
        ReductionResult::NewNode(AstNode::BinOp(
            indices[0].as_ast_node_id(),
            BinOpType::Div,
            indices[2].as_ast_node_id()
        ))
    }
);

// Sum -> Product
grammar.add_rule(NODE_SUM, &[NODE_PRODUCT.into()],
    |indices, tokens, nodes| {
        ReductionResult::Forward(indices[0].as_ast_node_id())
    }
);

// Sum -> Sum + Product
grammar.add_rule(NODE_SUM,
    &[NODE_SUM.into(), TOKEN_PLUS.into(), NODE_PRODUCT.into()],
    |indices, tokens, nodes| {
        ReductionResult::NewNode(AstNode::BinOp(
            indices[0].as_ast_node_id(),
            BinOpType::Add,
            indices[2].as_ast_node_id()
        ))
    }
);

// Sum -> Sum - Product
grammar.add_rule(NODE_SUM,
    &[NODE_SUM.into(), TOKEN_MINUS.into(), NODE_PRODUCT.into()],
    |indices, tokens, nodes| {
        ReductionResult::NewNode(AstNode::BinOp(
            indices[0].as_ast_node_id(),
            BinOpType::Sub,
            indices[2].as_ast_node_id()
        ))
    }
);
```

**Result:** Full precedence control (parentheses, `*`/`/`, `+`/`-`) with a simple two-variant AST!

## License

MIT