quill-sql 0.2.1

An educational Rust relational database (RDBMS) inspired by CMU 15445
Documentation
# QuillSQL Architecture

This chapter gives a tour of the major subsystems, the flow of a SQL statement, and the contract between execution, transactions, and storage. It uses Mermaid diagrams so you can render the visuals directly inside mdBook or any compatible viewer.

---

## 1. End-to-End Pipeline

```mermaid
flowchart LR
    subgraph Frontend
        CLI["bin/client"] --> Parser["sql::parser"]
        HTTP["bin/server"] --> Parser
    end
    Parser --> LPlan["plan::LogicalPlanner"]
    LPlan --> Optimizer["optimizer::LogicalOptimizer"]
    Optimizer --> PPlan["plan::PhysicalPlanner"]
    PPlan --> Exec["execution::ExecutionEngine (Volcano)"]

    subgraph Txn["transaction::*"]
        Session["session::SessionContext"]
        TM["TransactionManager"]
        LM["LockManager"]
        Session --> TM --> LM
    end

    Exec <-->|snapshot, locks| Txn
    Exec --> Binding

    subgraph Storage["storage::* & buffer::*"]
        Binding["storage::engine::TableBinding"]
        Heap["storage::table_heap::TableHeap"]
        MVCC["storage::heap::MvccHeap"]
        Index["storage::index::BPlusTree"]
        Buffer["buffer::BufferManager"]
        Disk["storage::disk_scheduler (io_uring)"]
        WAL["recovery::WalManager"]
        Binding --> Heap
        Binding --> Index
        Heap <--> Buffer
        Index <--> Buffer
        Buffer <--> Disk
        WAL --> Disk
    end

    Background["background::workers\n(checkpoint, WAL writer, MVCC vacuum)"] --> {Buffer, WAL, TM}
```

**Key takeaways**
- The frontend (CLI/HTTP) only knows how to parse SQL and drive the planning stages; all shared state lives below.
- The `ExecutionEngine` drives a Volcano iterator. Each physical operator calls into the transaction runtime for visibility checks and locking, but touches storage exclusively through a `TableBinding`. This makes the executor easy to reason about in a classroom setting.
- Buffering, WAL, and disk I/O are centralized so that durability/ordering guarantees stay in one module.

---

## 2. Transactions, MVCC, and the Executor

```mermaid
sequenceDiagram
    autonumber
    participant Session
    participant TM as TransactionManager
    participant Runtime as TxnRuntime
    participant Exec
    participant Binding
    participant Heap as TableHeap

    Session->>TM: begin_txn(isolation, access_mode)
    TM-->>Session: Transaction handle
    Exec->>Runtime: TxnRuntime::new(&TM, &mut txn)
    Runtime-->>Exec: {snapshot, command_id}

    loop per tuple
        Exec->>Binding: scan.next()
        Binding->>Heap: TableIterator::next()
        Heap-->>Binding: (rid, meta, tuple)
        Binding-->>Exec: (rid, meta, tuple)
        Exec->>Runtime: is_visible(meta)?
        Runtime-->>Exec: yes/no
        alt visible & needs lock
            Exec->>Runtime: lock_row(table, rid, mode)
            Runtime-->>Exec: guard
        end
        alt mutation
            Exec->>Binding: prepare_row_for_write(...)
            Binding->>Heap: mvcc.full_tuple(...)
            Heap-->>Binding: (current_meta, tuple)
            Binding->>TM: push_undo + append WAL(HeapRecord)
        end
    end

    Session->>TM: commit(txn)
    TM->>WAL: Transaction{Commit}
    TM->>TM: flush_until(lsn) if synchronous
    TM-->>Session: release locks, clear snapshot
```

**Snapshot policy**
- `Read Uncommitted / Read Committed`: every command obtains a fresh snapshot, so repeated statements can see committed updates.
- `Repeatable Read / Serializable`: cache the first snapshot on the `Transaction` handle; subsequent commands reuse it for consistent reads. The lock manager releases S-locks at the end of each command for RR, but Serializable keeps them until commit.

**Undo & WAL**
- Each mutation produces a logical `HeapRecordPayload` or `IndexRecordPayload`. The heap payload already carries redo (new bytes) and undo (old bytes), so recovery can replay forward or backward without re-reading heap pages.
- On abort, the manager walks the undo stack, emits CLRs, and re-applies the inverse heap/index operations.

---

## 3. Storage Layering

```mermaid
graph TD
    subgraph Planner View
        Catalog
    end
    Catalog -->|Arc<TableHeap>| TableBinding
    TableBinding --> Heap["TableHeap"]
    TableBinding --> Mvcc["MvccHeap"]
    TableBinding --> Index["BPlusTreeIndex"]
    Heap -->|pages| BufferMgr
    Index -->|pages| BufferMgr
    BufferMgr --> DiskMgr
    BufferMgr --> WalMgr
```

| Layer | Responsibility | Notes |
| ----- | -------------- | ----- |
| `TableHeap` | Physical slotted pages, tuple encoding, WAL page image helpers | Exposes insert/update/delete that emit heap-specific WAL payloads before dirtying frames. |
| `MvccHeap` | Version chain management, delete-marking, undo metadata | Creates new versions, rewires forward/back pointers, and delegates actual I/O to `TableHeap`. |
| `TableBinding` | Safe façade for the executor | Provides `scan`, `index_scan`, `insert`, `delete`, `update`, and `prepare_row_for_write`, always pairing heap/index changes so operators stay small. |
| `BufferManager` + `DiskScheduler` | Unified cache + async I/O | Uses LRU-K (+ optional TinyLFU admission) and io_uring to keep the hot set resident. |
| `WalManager` | ARIES-compatible log | Supports logical heap/index records, page write/delta fallback, CLRs, checkpoints, and background flush threads. |

---

## 4. `TableBinding` Contract

```mermaid
classDiagram
    class TableBinding {
        +scan() -> TupleStream
        +index_scan(name, IndexScanRequest) -> TupleStream
        +insert(&mut TxnContext, &Tuple)
        +delete(&mut TxnContext, RecordId, TupleMeta, Tuple)
        +update(&mut TxnContext, RecordId, Tuple, TupleMeta, Tuple) -> RecordId
        +prepare_row_for_write(&mut TxnContext, RecordId, &TupleMeta)
        +table_heap() -> Arc<TableHeap>
    }

    class PhysicalOperator {
        +init(&mut ExecutionContext)
        +next(&mut ExecutionContext) -> Option<Tuple>
    }

    PhysicalOperator --> TableBinding : uses
```

This binding hides every MVCC/WAL detail from the operators:
- No more ad-hoc `catalog.table_indexes()` calls.
- No direct references to `MvccHeap` or `TableHeap`.
- Executor code reads like pseudo SQL: “lock table”, “scan binding”, “update tuple”.

---

## 5. WAL & Recovery Details

```mermaid
flowchart LR
    subgraph WAL Record Types
        HI["HeapInsert"] --> redo
        HD["HeapDelete (with before-image)"] --> redo & undo
        LI["IndexLeaf{Insert,Delete}"]
        IS["IndexSplit/Merge/Redistribute/Parent*"]
        IR["IndexRoot Install/Adopt/Reset"]
        PI["PageWrite (FPW)"]
        CI["Checkpoint"]
        CL["CLR"]
    end
    Exec -->|Heap/Index payloads| WalMgr
    BufferMgr -->|first-touch FPW| WalMgr
    WalMgr --> DiskScheduler --> log files
    Recovery -->|analysis/redo/undo| BufferMgr
```

- **Logical logging first**: heap/index mutations emit redo + undo at the logical level. This keeps the WAL stream compact and human-readable for teaching.
- **Physical fallbacks**: metadata-heavy pages (meta, freelist, header) still leverage PageWrite FPWs to guarantee a consistent base image, especially on the first page flush after a crash.
- **Restart**: `RecoveryManager` performs the classical ARIES sequence. It uses the `dirty_page_table` and `active_txn_table` captured by checkpoints to limit redo and undo work.

---

## 6. Background Workers

| Worker | Purpose | Configuration |
| ------ | ------- | ------------- |
| WAL writer | Periodically flushes WAL buffers, coalesces adjacent writes | `WalOptions::writer_interval_ms`, `buffer_capacity` |
| Checkpoint | Records ATT + DPT, gently pushes dirty buffers | `WalOptions::checkpoint_interval_ms` |
| Buffer writer | Flushes frames when the dirty set grows too large | `background::BufferWriterConfig` |
| MVCC vacuum | Reclaims obsolete tuple versions based on `safe_xmin` | `MvccVacuumConfig` |
| Index vacuum | Cleans up deleted index entries using B-link traversal | `IndexVacuumConfig` |

Workers register with `background::BackgroundWorkers`, so the database can stop and join them cleanly on shutdown. Each worker publishes metrics (intervals, batches processed) for observability.

---

## 7. Example: Repeatable Read UPDATE

```mermaid
sequenceDiagram
    participant T1 as Txn 1 (RR)
    participant T2 as Txn 2 (RC)
    participant Heap as TableHeap

    T1->>Heap: SELECT (snapshot S0)
    Heap-->>T1: version V0
    T2->>Heap: UPDATE -> insert V1, delete V0
    Heap-->>T2: ack
    T2->>T2: COMMIT (WAL + locks release)
    T1->>Heap: SELECT again
    Heap-->>T1: V0 still visible via snapshot S0
    T1->>T1: COMMIT
    Note over T1,Heap: Vacuum later removes V0 when safe_xmin advances
```

This timeline demonstrates:
- Snapshots shield Repeatable Read statements from concurrent writers even if row-level locks are released early.
- The MVCC chain (`V1.prev_version = V0`, `V0.next_version = V1`) lets future readers reach the newest committed version while the vacuum worker reclaims obsolete ones lazily.

---

## 8. Observability & Configuration Cheat Sheet

- **Logging**: `RUST_LOG=trace` surfaces lock acquisitions, MVCC vacuums, and io_uring completions.
- **Runtime knobs**: `WalOptions` (segment size, sync policy), `BufferPoolConfig` (capacity, TinyLFU toggle), `MvccVacuumConfig`, and `IndexVacuumConfig` can all be adjusted via `DatabaseOptions`.
- **Metrics**: `background::BackgroundWorkers::snapshot()` reports worker health; WAL exposes current LSN and flush LSN; the buffer manager can dump the dirty page table for diagnostics.

---

With these layers in place, QuillSQL remains faithful to production-grade engines (MVCC + WAL + buffer pool) while keeping its code and documentation approachable for coursework and research prototypes.