# graphitesql roadmap
This document is the plan for building **graphitesql**: a single-crate, pure,
safe, `no_std` Rust implementation of SQLite with byte-for-byte compatibility
with the SQLite 3 file format.
It is meant to be read top-to-bottom once, then used as a checklist. Each phase
lists its **deliverable**, its **done criterion**, and the **upstream SQLite
files** that serve as the specification for that phase (fetch them with
`reference/fetch.sh`).
---
## 1. Architecture
SQLite has a famously clean layered design. We mirror it, because the layering is
what makes the file format and the SQL semantics tractable to re-implement
independently. Data flows top-to-bottom on writes and bottom-to-top on reads:
```
┌──────────────────────────────────────────────┐
SQL text │ api Connection / Statement / Row │ public API
├──────────────────────────────────────────────┤
│ sql::token tokenizer │
│ sql::parser parser ──► sql::ast │ front end
├──────────────────────────────────────────────┤
│ planner query planning (join/index) │
│ codegen AST ──► VDBE bytecode │ compiler
├──────────────────────────────────────────────┤
│ vdbe register virtual machine │ execution
│ func collate built-in functions, collations │
├──────────────────────────────────────────────┤
│ btree table & index B-trees, cursors │ data model
├──────────────────────────────────────────────┤
│ pager page cache, transactions, │ storage
│ rollback journal, WAL, locking │
├──────────────────────────────────────────────┤
│ format on-disk byte layout (the spec) │ format
├──────────────────────────────────────────────┤
│ vfs Vfs / File traits (mem, std, …) │ OS boundary
└──────────────────────────────────────────────┘
```
### Module ↔ upstream map
graphitesql is one crate; these are its modules and the SQLite files that
specify each. Reading the C file is the fastest way to get a phase right.
| `vfs` | OS abstraction: open/read/write/sync/lock | `os_unix.c`, `os_win.c`, `os.c` |
| `format` | byte layout of header, pages, cells, records, freelist | `fileformat2.html`, `btree.c` (comments), `btreeInt.h` |
| `pager` | page cache, atomic commit, journal, WAL, locking | `pager.c`, `wal.c`, `pcache.c`, `pcache1.c` |
| `btree` | table/index B-trees, cursors, balancing | `btree.c`, `btreeInt.h` |
| `value` / record | storage classes, serial types, affinity | `vdbemem.c`, `vdbeaux.c` |
| `sql::token` | tokenizer | `tokenize.c`, `keywordhash.h` |
| `sql::parser`/`ast`| grammar → parse tree | `parse.y`, `expr.c`, `resolve.c` |
| `codegen` | parse tree → bytecode | `build.c`, `insert.c`, `update.c`, `delete.c`, `select.c` |
| `planner` | join order, index selection, query flattening | `where.c`, `wherecode.c`, `whereexpr.c`, `analyze.c` |
| `vdbe` | the register VM and its opcodes | `vdbe.c`, `vdbeaux.c`, `vdbeapi.c`, `opcodes.h` |
| `func` / `collate` | scalar/aggregate funcs, collations | `func.c`, `date.c`, `callback.c` |
| `schema` | parse `sqlite_schema`, build the catalog | `build.c`, `prepare.c` |
| `api` | `Connection`/`Statement` and (later) C-API shim | `main.c`, `vdbeapi.c`, `legacy.c` |
### Why a VDBE (and not a tree-walker)
The VDBE bytecode is **not** stored in the database file, so file compatibility
does not require it. We adopt it anyway because:
1. SQLite's observable semantics (evaluation order, type coercion, `NULL`
handling, the exact rows a query returns) are defined operationally by what
the VDBE does. Matching the model is the surest path to matching behavior.
2. It cleanly separates "what to compute" (codegen/planner) from "how to run it"
(vdbe), which keeps each piece independently testable.
3. `EXPLAIN` output becomes directly comparable to SQLite's for differential
testing.
We will not be bug-for-bug identical in bytecode, but we will be result-identical.
---
## 2. Design principles
- **`#![forbid(unsafe_code)]`, no exceptions.** Enforced in `Cargo.toml` lints.
If a hot path seems to need `unsafe`, redesign or accept the safe cost.
- **`no_std` + `alloc` is the baseline.** `std` is an additive feature (real
files, `std::error::Error`). Nothing core may depend on `std`.
- **Zero dependencies.** No crates in the default build. Optional dev/test-only
dependencies (e.g. for property testing) are acceptable behind `cfg(test)`.
- **The VFS is the only I/O boundary.** All file access goes through the `Vfs`
and `File` traits. This is what makes `:memory:`, std files, and wasm uniform.
- **Compatibility is verified, not assumed.** Every format phase ends with a
differential test against the real `sqlite3` library/CLI (see §4).
- **Fail loud while young.** Unimplemented paths return `Error::Unsupported`
rather than silently producing wrong results.
- **Document with the spec.** Each format module carries the relevant
file-format table in its doc comment, so the code and the spec live together.
---
## 3. Phases
Phases are ordered so that each builds only on completed lower layers, and so
that something useful and testable exists early (read-only access to real
databases lands well before write support).
### Phase 0 — Foundation ✅ *(done)*
- **Deliverable:** crate scaffold; varints; value/serial-type model; database
header parse/serialize; error type; CI-ready `cargo test`/`clippy`.
- **Done:** `no_std` builds; header round-trips a real `sqlite3` file
byte-for-byte; all unit tests green.
- **Files:** `src/util/varint.rs`, `src/value.rs`, `src/format/header.rs`,
`src/error.rs`.
### Phase 1 — VFS & raw paging *(read side)* ✅ *(done)*
- **Deliverable:** `Vfs`/`File` traits; in-memory VFS; std-file VFS (feature
`std`); a read-only pager that maps page numbers → byte slices and validates
page size against the header.
- **Done:** opens the committed real-SQLite fixtures (`tests/fixtures/*.db`)
through the std VFS + pager, re-derives the header, and reads every page;
page 1's 100-byte header offset is handled via `Page::body_offset`.
- **Files:** `src/vfs/{mod,memory,std_file}.rs`, `src/pager/mod.rs`,
`tests/read_fixtures.rs`.
- **Reference:** `os_unix.c` (the VFS contract), `pager.c` (page lifecycle).
### Phase 2 — B-tree reader ✅ *(done)*
- **Deliverable:** parse table & index, interior & leaf b-tree pages; cell
parsing including overflow-page chains; a forward/seek cursor over a table
b-tree keyed by rowid, and over an index b-tree keyed by record.
- **Done:** table scan of `nums` (2000 rows, interior pages) sums correctly;
20 KB blob row reassembles across overflow pages; rowid `seek` (exact / past-
end / before-first) works; index scan yields every entry. All verified
against real fixtures. *(Index seek-by-key awaits record comparison in
Phase 3.)*
- **Files:** `src/btree/{mod,page,cursor}.rs`; tests in `tests/read_fixtures.rs`.
- **Reference:** `btree.c`, `btreeInt.h`; `fileformat2.html` (B-tree pages,
cell format, overflow).
### Phase 3 — Record decoding & the schema catalog ✅ *(done)*
- **Deliverable:** decode records (header of serial types + bodies) into
`Value`s; read and parse `sqlite_schema` (table 1) into an in-memory catalog
of tables/indexes with their root pages and SQL.
- **Done:** full record decode incl. all integer widths (sign-extended),
REAL, TEXT (UTF-8/16), BLOB; `Schema::read` enumerates `sqlite_schema`,
resolves a table name → root page; end-to-end test decodes every row of
`basic.db` to typed values (incl. INTEGER-PRIMARY-KEY → NULL aliasing).
- **Files:** `src/format/record.rs`, `src/schema/mod.rs`.
- **Reference:** `fileformat2.html` (record format, schema table), `build.c`.
### Phase 4 — SQL front end (tokenizer + parser + AST) ✅ *(done)*
- **Deliverable:** tokenizer matching SQLite's lexical rules and keyword set; a
recursive-descent parser producing an AST for the core grammar: `SELECT`
(with `WHERE`/`GROUP BY`/`HAVING`/`ORDER BY`/`LIMIT`/joins), `INSERT`,
`UPDATE`, `DELETE`, `CREATE TABLE/INDEX`, `DROP`, transactions, and `PRAGMA`.
- **Done:** tokenizer covers strings/blobs/quoted-idents/comments/params/numbers
(UTF-8 safe); Pratt expression parser with SQLite precedence incl.
`IS [NOT] NULL`, `IN`, `BETWEEN`, `LIKE`/`GLOB`, `CASE`, `CAST`; reserved-word
checks reject `SELECT FROM`; ~20 parser/tokenizer tests. *(Subqueries & CTEs
deferred to Phase 9.)*
- **Files:** `src/sql/{mod,token,ast,parser}.rs`.
- **Reference:** `tokenize.c`, `parse.y`, `expr.c`. (We hand-write the parser
rather than port the Lemon grammar, but `parse.y` is the source of truth for
precedence and accepted forms.)
### Phase 5 — Read-query execution engine ✅ *(done)*
- **Deliverable:** a `Connection` that parses, resolves names against the schema,
scans b-trees, decodes records, and evaluates expressions to produce result
rows; `NULL` three-valued logic, SQLite comparison order & numeric coercion,
`LIKE`/`GLOB`, `CASE`/`CAST`, a core of scalar functions, and the aggregates
`count`/`sum`/`avg`/`min`/`max`/`total`/`group_concat` with `GROUP BY`/`HAVING`,
plus `WHERE`, `ORDER BY` (by position/alias/expr), `LIMIT`/`OFFSET`, `DISTINCT`.
- **Done:** `Connection::open(path).query(sql)` returns correct rows on the real
fixtures, verified differentially against `sqlite3` (e.g. `GROUP BY id%3`
counts 666/667/667). INTEGER-PRIMARY-KEY rowid aliasing handled.
- **Files:** `src/exec/{mod,eval,func}.rs`, `src/util/float.rs`,
`tests/query.rs`.
- **Implementation note — executor vs. bytecode:** the roadmap originally
specified a VDBE bytecode VM. graphitesql instead ships an **operational,
iterator-style executor** with the *same observable semantics*. This was the
pragmatic path to a correct, testable read engine; adopting a VDBE bytecode IR
is now an internal refactor (it changes how queries are represented, not their
results) and is tracked for a later pass. *(Single-table only; joins/subqueries
in Phase 9.)*
- **Reference:** `vdbe.c`, `select.c`, `where.c`, `func.c`.
### Phase 6 — B-tree writer + pager transactions *(write side)*
- **Deliverable:** insert/delete/update cells with page splitting and balancing;
freelist management; the **rollback journal** with correct fsync ordering and
crash-safe atomic commit; locking state machine; `BEGIN`/`COMMIT`/`ROLLBACK`.
- **Done:** create a database from scratch and `INSERT`/`UPDATE`/`DELETE`; the
resulting file opens in `sqlite3` and `PRAGMA integrity_check` passes; a
simulated crash mid-commit recovers to a consistent state.
- **Reference:** `btree.c` (balance), `pager.c` (journal, commit), `fileformat2.html`
(freelist, the rollback journal format).
### Phase 7 — DDL & DML codegen
- **Deliverable:** codegen for `CREATE TABLE/INDEX`, `INSERT`, `UPDATE`,
`DELETE`, `ALTER TABLE`, `DROP`; schema mutation writes a correct
`sqlite_schema` and bumps the schema cookie; `ROWID`/`INTEGER PRIMARY KEY`
and `WITHOUT ROWID` tables; `UNIQUE`/`NOT NULL`/`CHECK`/`DEFAULT` constraints.
- **Done:** the full `gtest.db` creation script run through graphitesql produces
a file byte-comparable (modulo documented nondeterminism) to `sqlite3`'s, and
the reverse opens cleanly.
- **Reference:** `build.c`, `insert.c`, `update.c`, `delete.c`, `alter.c`.
### Phase 8 — WAL mode
- **Deliverable:** write-ahead log read & write; `-wal` and `-shm` handling;
checkpointing; the WAL locking protocol; `PRAGMA journal_mode=wal`.
- **Done:** a WAL-mode database is interoperable with `sqlite3` in both
directions, including reading a database another writer left mid-WAL.
- **Reference:** `wal.c`, `fileformat2.html` (the WAL format).
### Phase 9 — Compatibility hardening & breadth
- **Deliverable:** foreign keys & triggers; views; `WITH`/CTEs & recursive
queries; window functions; the rest of the built-in functions (`date/time`,
`printf`, math); `EXPLAIN`; the bulk of `PRAGMA`s; `VACUUM`; collation &
encoding edge cases; the SQLite type-affinity rules in full.
- **Done:** pass a curated slice of SQLite's own test assertions and a large
differential corpus.
- **Reference:** `fkey.c`, `trigger.c`, `window.c`, `date.c`, `pragma.c`,
`vacuum.c`, plus SQLite's `test/` TCL suite as an oracle.
### Phase 10 — Ecosystem & extensions *(post-1.0, behind features)*
- C-API shim crate (`libsqlite3`-compatible surface), virtual tables, FTS5,
R-Tree, `sqlite3_session`, user-defined functions from Rust, an async VFS for
wasm. Each is opt-in and out of the core compatibility promise.
---
## 4. File-format compatibility strategy
This is the project's whole reason to exist, so it gets first-class testing.
- **Golden-file tests.** Check small, deterministic databases produced by a
pinned `sqlite3` into the test suite (as byte arrays or fixtures) and assert
graphitesql parses them exactly. Phase 0 already does this for the header.
- **Round-trip tests.** Parse → re-serialize → assert identical bytes for every
format structure (header done; pages, cells, records, journal, WAL to come).
- **Differential tests.** For each phase from 5 on, run the same SQL through
both `sqlite3` (the C library) and graphitesql and diff the results, and diff
the resulting database files where determinism allows.
- **`integrity_check` as a gate.** Any database graphitesql writes must pass
`sqlite3`'s `PRAGMA integrity_check` and `PRAGMA foreign_key_check`.
- **Fuzzing.** Fuzz the readers with malformed pages (must return
`Error::Corrupt`, never panic) and fuzz SQL parsing.
- **Crash-recovery tests.** A VFS that can inject failures/truncation at chosen
fsync points, asserting recovery to a consistent state.
### Known sources of legitimate file divergence
Two SQLite-compatible writers can produce different bytes for the same logical
content. We document and accept these rather than chase them:
- free-page reuse order and exact balancing splits,
- `change_counter` / `version_valid_for` values,
- the embedded `SQLITE_VERSION_NUMBER` of the writer,
- unused/reserved bytes left from deletions.
Compatibility means *both engines can read each other's files and agree on
contents*, not that bytes are identical for independently-built databases.
---
## 5. Open decisions
These are deliberately unresolved; we'll settle them with input rather than by
default. Tracked here so they don't get lost.
1. ~~**License.**~~ *Resolved:* **public domain**, mirroring SQLite, with the
SQLite blessing in place of a legal notice (SPDX `blessing`). See `LICENSE`.
2. **Parser: hand-written vs. generated.** Plan is a hand-written
recursive-descent parser (no build-time codegen, friendlier errors). The
alternative is porting the Lemon grammar. → *Leaning hand-written.*
3. **Concurrency model.** Single-connection first. Whether to support multiple
connections / threads sharing a pager (and how locking maps onto the `Vfs`
trait) is a Phase 6+ question. → *Deferred.*
4. **Big integers / decimal.** SQLite stores reals as f64; we match that. No
extended numeric types planned.
5. **MSRV policy.** Pinned at 1.95 today; revisit before 1.0.
---
## 6. Immediate next steps (Phase 1 kickoff)
1. Define `Vfs` and `File` traits in `src/vfs/mod.rs` (open flags, read-at,
write-at, truncate, sync, file size, lock/unlock).
2. Implement `src/vfs/memory.rs` (a `Vec<u8>`-backed file) — works on wasm and
`no_std`.
3. Implement `src/vfs/std_file.rs` behind `feature = "std"`.
4. Implement a read-only `Pager` that, given a `File`, validates the header and
hands out page byte-slices, accounting for page 1's 100-byte header offset.
5. Add the first end-to-end test: open `gtest.db` through the std VFS + pager and
reconstruct the header — the same `DatabaseHeader` Phase 0 parses directly.