php-parser 0.1.3

# PHP Parser Architecture & Design Specification (v2.0)

## 1. Overview

**Objective:** specific Build a production-grade, fault-tolerant, zero-copy PHP parser in Rust.
**Target Compliance:** PHP 8.x Grammar.
**Key Architectural Principles:**

1. **Strict Lifetime Separation:** distinct lifetimes for Source Code (`'src`) and AST/Arena (`'ast`).
2. **Pure Arena Allocation:** The AST contains *no* heap allocations (`Vec`, `String`, `Box`). All data lives in the `Bump` arena.
3. **Resilience:** The parser never panics or aborts. It produces Error Nodes and synchronizes to recover context.
4. **Byte-Oriented:** Input is processed as `&[u8]` to handle mixed encodings safely, with Spans representing byte offsets.

---

## 2. Core Data Structures

### 2.1. Spans (Source Mapping)

Spans represent byte offsets. We do not assume UTF-8 validity at the `Span` level, allowing the parser to handle binary strings or legacy encodings if needed.

```rust
#[derive(Debug, Clone, Copy, PartialEq, Eq, Default)]
pub struct Span {
    pub start: usize,
    pub end: usize,
}

impl Span {
    pub fn new(start: usize, end: usize) -> Self { Self { start, end } }
    
    pub fn len(&self) -> usize { self.end - self.start }
    
    /// Safely slice the source. Returns None if indices are out of bounds.
    pub fn as_str<'src>(&self, source: &'src [u8]) -> &'src [u8] {
        &source[self.start..self.end]
    }
}
```

### 2.2. Tokens (The Lexeme)

Tokens are lightweight. Complex data (identifiers, literals) are not stored in the token; only their `Span` is.

```rust
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct Token {
    pub kind: TokenKind,
    pub span: Span,
}

#[derive(Debug, Clone, PartialEq, Eq)]
pub enum TokenKind {
    // Keywords (Hard)
    Function, Class, If, Return,
    // Keywords (Soft/Contextual - e.g. 'readonly', 'match')
    Identifier, 
    // Literals
    LNumber, // Integer
    DNumber, // Float
    StringLiteral, // '...' or "..."
    Variable,      // $var
    // Symbols
    Arrow, // ->
    Plus,
    OpenTag, // <?php
    Eof,
}
```

---

## 3. The AST (Abstract Syntax Tree)

### 3.1. Memory Layout Strategy

To ensure high cache locality and zero heap fragmentation:

1. **No `Box<T>`:** Use references `&'ast T`.
2. **No `Vec<T>`:** Use slice references `&'ast [T]`.
3. **Handle Types:** Use type aliases to make signatures readable.

```rust
use bumpalo::Bump;

/// Lifetime 'ast: The duration the Arena exists.
pub type ExprId<'ast> = &'ast Expr<'ast>;
pub type StmtId<'ast> = &'ast Stmt<'ast>;
```

### 3.2. AST Definitions

All AST nodes include a `span` covering the entire construct.

```rust
#[derive(Debug)]
pub struct Program<'ast> {
    pub statements: &'ast [StmtId<'ast>],
    pub span: Span,
}

#[derive(Debug)]
pub enum Stmt<'ast> {
    Echo {
        exprs: &'ast [ExprId<'ast>], // Arena-backed slice
        span: Span,
    },
    Function {
        name: &'ast Token, // Reference to the identifier token
        params: &'ast [Param<'ast>],
        body: &'ast [StmtId<'ast>],
        span: Span,
    },
    /// Represents a parsing failure at the Statement level
    Error {
        span: Span,
    },
    // ...
}

#[derive(Debug)]
pub enum Expr<'ast> {
    Binary {
        left: ExprId<'ast>,
        op: BinaryOp,
        right: ExprId<'ast>,
        span: Span,
    },
    Variable {
        name: Span,
        span: Span,
    },
    /// Represents a parsing failure at the Expression level
    Error {
        span: Span,
    },
    // ...
}
```

---

## 4. The Lexer (Context & State)

The Lexer is a **state machine** that operates on `&[u8]`. It accepts hints from the Parser to handle "Soft Keywords" (e.g., treating `match` as an identifier when following `->`).

### 4.1. Lexer Modes

The parser controls the lexer's sensitivity to keywords.

```rust
#[derive(Debug, Clone, Copy, PartialEq)]
pub enum LexerMode {
    Standard,           // Normal PHP parsing
    LookingForProperty, // After '->' or '::'. Keywords become identifiers.
    LookingForVarName,  // After '$'. 
}
```

### 4.2. Lexer Implementation

```rust
pub struct Lexer<'src> {
    input: &'src [u8],
    cursor: usize,
    /// Stack for interpolation (Scripting, DoubleQuote, Heredoc)
    state_stack: Vec<LexerState>, 
    /// Current internal state (e.g. InScripting)
    internal_state: LexerState,
    /// Mode hint from Parser
    mode: LexerMode,
}

impl<'src> Lexer<'src> {
    pub fn new(input: &'src [u8]) -> Self { /* ... */ }

    /// Called by the Parser/TokenSource to change context
    pub fn set_mode(&mut self, mode: LexerMode) {
        self.mode = mode;
    }
}

impl<'src> Iterator for Lexer<'src> {
    type Item = Token;
    fn next(&mut self) -> Option<Self::Item> {
        // Logic combining internal_state + self.mode
    }
}
```

---

## 5. The Parser (Recursive Descent + Pratt)

The parser orchestrates the Lexer, the Arena, and Error Recovery.

### 5.1. Token Source Abstraction

Decouples the parser from the raw lexer, enabling lookahead (LL(k)).

```rust
pub trait TokenSource<'src> {
    fn current(&self) -> &Token;
    fn lookahead(&self, n: usize) -> &Token;
    fn bump(&mut self);
    fn set_mode(&mut self, mode: LexerMode);
}
```

### 5.2. Parser Struct

Separates input lifetime (`'src`) from output lifetime (`'ast`).

```rust
pub struct Parser<'src, 'ast, T: TokenSource<'src>> {
    tokens: T,
    arena: &'ast Bump,
    errors: Vec<ParseError>,
    /// Marker to use 'src
    _marker: std::marker::PhantomData<&'src ()>, 
}
```

### 5.3. Error Recovery Strategy

The parser uses **Error Nodes** and **Synchronization**.

1. **Expected Token Missing:** Record error, insert synthetic node/token if trivial, or return `Expr::Error`.
2. **Unexpected Token:** Record error, advance tokens until a "Synchronization Point" (`;`, `}`, `)`).

```rust
impl<'src, 'ast, T: TokenSource<'src>> Parser<'src, 'ast, T> {
    
    /// Main entry point for expressions
    fn parse_expr(&mut self, min_bp: u8) -> ExprId<'ast> {
        // Check binding power, recurse...
        // If syntax is invalid, do NOT panic.
        // self.errors.push(...);
        // return self.arena.alloc(Expr::Error { span })
    }

    /// Synchronization helper
    fn sync_to_stmt_boundary(&mut self) {
        while self.tokens.current().kind != TokenKind::Eof {
            match self.tokens.current().kind {
                TokenKind::SemiColon | TokenKind::CloseBrace => {
                    self.tokens.bump();
                    return;
                }
                _ => self.tokens.bump(),
            }
        }
    }
}
```

---

## 6. Public API

This defines the library boundary.

```rust
pub struct ParseResult<'ast> {
    pub program: Program<'ast>,
    pub errors: Vec<ParseError>,
}

/// The main entry point.
/// 
/// - `source`: Raw bytes of the PHP file.
/// - `arena`: The Bump arena where AST nodes will be allocated.
pub fn parse<'src, 'ast>(
    source: &'src [u8], 
    arena: &'ast Bump
) -> ParseResult<'ast> {
    let lexer = Lexer::new(source);
    let mut parser = Parser::new(lexer, arena);
    parser.parse_program()
}
```

---

## 7. Development Phases

### Phase 1: Infrastructure & Basics

1. **Lexer MVP:** Implement `TokenSource`, `Lexer` with basic states (`Initial`, `Scripting`).
2. **Arena Setup:** Integrate `bumpalo`.
3. **AST Skeleton:** Define basic `Stmt` and `Expr` structs.
4. **Test Harness:** Setup `insta` for snapshot testing.

### Phase 2: Expression Engine (Pratt)

1. **Precedence Table:** Map PHP precedence to Binding Powers.
2. **Operators:** Implement Binary, Unary, Ternary, and `instanceof`.
3. **Error Nodes:** Ensure malformed math (e.g., `1 + * 2`) produces `Expr::Error`.

### Phase 3: Statements & Control Flow

1. **Block Parsing:** Handle `{ ... }` and scopes.
2. **Control Structures:** `if`, `while`, `return`.
3. **Synchronization:** Implement `sync_to_stmt_boundary` to recover from missing semicolons.

### Phase 4: Advanced Lexing

1. **Interpolation Stack:** `DoubleQuotes`, `Heredoc`, `Backticks`.
2. **Complex Identifiers:** Support `LexerMode::LookingForProperty` for `$obj->class`.

---

## 8. Testing Strategy

1. **Unit Tests:** For individual Lexer state transitions.
2. **Snapshot Tests (Insta):**
    * Input: `test.php`
    * Output: Textual representation of the AST (Debug fmt).
    * Purpose: Catch regressions in tree structure.
3. **Recovery Tests:**
    * Input: `<?php echo 1 + ; echo "done";`
    * Assert: `program.statements[0]` is `Echo(Expr::Error)`.
    * Assert: `program.statements[1]` is `Echo("done")`.
    * The parser must *not* stop at the first semicolon error.
4. **Corpus Tests:** Parse large open-source PHP projects (WordPress, Laravel) to ensure no panics on valid code.