# Concurrent writes plan β MVCC + `BEGIN CONCURRENT`
> π **Looking for the user-facing reference?** This is the original
> design proposal, kept as the historical record of the decisions
> that shaped Phase 11. For the shipped surface β SQL, embedding API,
> SDK error mapping, REPL meta-commands, durability story,
> limitations β read [**`concurrent-writes.md`**](concurrent-writes.md)
> first; come back here when you want the *why* and the
> sequencing discussion.
**Status:** shipped end-to-end through Phase 11.11a (May 2026); a small set of follow-ups remain explicitly parked β see the [roadmap](roadmap.md#phase-11--concurrent-writes-via-mvcc--begin-concurrent-sqlr-22-in-flight--see-concurrent-writes-planmd). Drafted 2026-05-07.
**Inspiration:** [Turso](https://turso.tech) β a SQLite-compatible engine, written in Rust, that implements multi-version concurrency control to lift SQLite's single-writer ceiling. See [`turso/core/mvcc/`](https://github.com/tursodatabase/turso/tree/main/core/mvcc) and the [Turso concurrent-writes docs](https://docs.turso.tech/tursodb/concurrent-writes).
**Tracks:** [SQLR-22](https://app.marvinapp.io/) (Marvin).
This document proposes adding **multi-version concurrency control (MVCC)** and a **`BEGIN CONCURRENT`** transaction mode to SQLRite, enabling multiple writers in the same process to make progress in parallel under snapshot isolation, with row-level write-write conflict detection at commit. It is intentionally a *plan* β there is no code yet.
The output of this work, when shipped, would put SQLRite in the same conceptual bucket as Turso's `--experimental-mvcc`: the file format stays compatible with the existing v5 layout, the legacy serialized path keeps working, and apps opt in by setting a journal mode and using a new `BEGIN CONCURRENT` statement.
---
## 1. Why bother
SQLite's single-writer rule is the design choice users hit first when they scale. It's not a bug β it's the consequence of a single global file lock (`PENDING`/`EXCLUSIVE`) that serializes commit ordering β but in workloads where two concurrent writers touch *disjoint* rows it costs throughput unnecessarily.
Turso's measurements report **up to 4Γ the write throughput of SQLite** when the workload is dominated by non-conflicting writes ([Turso v0.5.0 blog post](https://turso.tech/blog/turso-0.5.0); [Beyond the single-writer limitation](https://turso.tech/blog/beyond-the-single-writer-limitation-with-tursos-concurrent-writes)). For SQLRite β which today inherits SQLite's single-writer model end-to-end (advisory `flock(LOCK_EX)` plus a single `Pager` per process; see [pager.md](pager.md) and [Decision 11 in design-decisions.md](design-decisions.md)) β the same uplift is theoretically available, *if* the engine learns to track row versions and detect write-write conflicts at commit instead of holding an exclusive write lock for the duration.
This is also a natural fit for SQLRite's stated remit (a hands-on study of database internals). MVCC, snapshot isolation, optimistic concurrency control, and version-chain garbage collection are textbook database-systems topics; implementing them turns the engine into a credible reference for those topics rather than a SQLite clone with extra storage tricks.
---
## 2. What Turso does
Turso's MVCC implementation is the most directly relevant prior art β same target SQL, same general layout, same Rust language, and they've already navigated most of the design questions we'd face. A summary of the moving parts (cross-checked against [`core/mvcc/mod.rs`](https://github.com/tursodatabase/turso/blob/main/core/mvcc/mod.rs) and [`core/mvcc/database/mod.rs`](https://github.com/tursodatabase/turso/blob/main/core/mvcc/database/mod.rs)):
### 2.1 Activation
```sql
PRAGMA journal_mode = mvcc; -- per-connection switch
BEGIN CONCURRENT;
-- writes
COMMIT; -- may return SQLITE_BUSY β caller retries
```
Compiled in behind `--experimental-mvcc`. `BEGIN CONCURRENT` doesn't acquire any locks; `BEGIN IMMEDIATE` / `BEGIN DEFERRED` retain SQLite's exclusive-write semantics.
### 2.2 In-memory version index
The MVCC store keeps an in-memory map keyed by `RowID { table_id, row_key }` whose value is a chain of `RowVersion` records. Each version carries:
- `begin: TxTimestampOrID` β the commit timestamp (or transaction ID, while uncommitted) at which this version becomes visible
- `end: Option<TxTimestampOrID>` β when it stops being visible (a later UPDATE/DELETE pushes a new head onto the chain)
- the row payload itself
Visibility for a reader transaction with begin-timestamp `T` is the textbook snapshot-isolation rule: pick the version whose `begin <= T < end`.
The Hekaton paper (Larson et al., *VLDB 2011*) is the explicit reference. The same paper also covers the validation phase used at commit and the garbage-collection rules sketched in Β§7 below.
### 2.3 Read path
- If the row is present in the MVCC index β use it (latest-wins; all writes funnel through the index, so the in-memory copy is the source of truth).
- Otherwise β fall through to the existing pager β WAL β main-file read path. Result is materialized into the index on first touch ("eager loaded" today, which is the source of Turso's memory-pressure caveats).
### 2.4 Write path
Writes go entirely to the MVCC index, tagged with the transaction ID. They're invisible to other transactions until the writer commits. The transaction also tracks a **read set** and a **write set** for validation.
### 2.5 Commit / conflict detection
Three phases:
1. Take a commit timestamp from a logical clock.
2. Walk the write set; for each row, check whether *any* version newer than the transaction's begin timestamp exists in the index. If yes β write-write conflict. The transaction is aborted and the caller sees `SQLITE_BUSY` (or, in libSQL terms, `Busy` / `BusySnapshot`).
3. Otherwise, stamp the new versions with the commit timestamp, append a WAL log record describing the row deltas, and fsync. The pager's WAL machinery is reused unchanged.
Conflict detection is **row-level**, not page-level. This is the critical contrast with the upstream SQLite `BEGIN CONCURRENT` patch, which only resolves at page granularity and forces aborts when two transactions touch unrelated rows that happen to share a page.
### 2.6 Checkpointing
The MVCC log is "eventually checkpointed into the SQLite database file via the WAL." Specifically: the writer that checkpoints walks the committed-but-unmaterialized portion of the version chain, materializes the latest visible version of each row into the page cache, and lets the regular pager flush it to the WAL/main file. Recovery on reopen relies on the WAL plus the persistent on-disk state β the in-memory MVCC log is rebuilt empty.
### 2.7 Garbage collection
Versions whose `end` timestamp is older than the oldest active reader's begin-timestamp are dead and may be reclaimed. Turso's implementation today uses a contiguous `Vec` per row guarded by an `RwLock`, with a scan-based GC pass that runs opportunistically. The known issue ([turso#3499](https://github.com/tursodatabase/turso/issues/3499)) is that the lock is the bottleneck under heavy contention; a wait-free chain is on their roadmap.
### 2.8 Limitations (as of writing)
- `CREATE INDEX` is not yet supported in MVCC mode β neither maintaining secondary indexes through the version chain nor reading them under snapshot isolation.
- `AUTOINCREMENT` is not supported (the global counter is not yet versioned).
- `wal_checkpoint(TRUNCATE)` is the only checkpoint variant; it blocks readers and writers.
- "Eager loading" β first access to a table loads it fully into the version index. Memory cost scales with the table, not the working set.
- Row deltas are full row copies, not column-level diffs ([turso#3498](https://github.com/tursodatabase/turso/issues/3498)).
- I/O is synchronous; `io_uring` adoption is on the roadmap ([turso#1848](https://github.com/tursodatabase/turso/issues/1848)).
These are honest limits, not gotchas. They suggest where SQLRite's plan should explicitly punt, and where it might do better by virtue of being smaller.
---
## 3. SQLRite's current concurrency model
A clear-eyed inventory of what we have today, since the gap is what the plan has to bridge.
### 3.1 Process-level locking ([`src/sql/pager/pager.rs`](../src/sql/pager/pager.rs))
`Pager::open` takes `flock(LOCK_EX | LOCK_NB)` on both the main file and the WAL sidecar via `fs2`. `Pager::open_read_only` takes `flock(LOCK_SH | LOCK_NB)`. Reads and writes are mutually exclusive *across* processes β POSIX flock semantics, no shared-memory coordination file. **Within** a process, the engine runs one writer because the public surface only exposes one `Connection` per `Database`/`Pager`.
### 3.2 Transactions ([`src/sql/db/database.rs:149-195`](../src/sql/db/database.rs))
`BEGIN` deep-clones the entire `tables: HashMap<String, Table>` into a `TxnSnapshot` stashed on `Database::txn`. `ROLLBACK` swaps it back. `COMMIT` clears the snapshot and the next `save_database` call appends one WAL commit frame. Auto-save is suppressed while a transaction is open. Nested begins are rejected. There is no concept of read-set / write-set tracking, no logical clock, no row-version metadata.
### 3.3 Storage ([`src/sql/pager/`](../src/sql/pager/) + [Decision 6 in design-decisions.md](design-decisions.md))
- 4 KiB pages, cell-encoded B-trees per table and per secondary index.
- **`save_database` rebuilds every B-tree bottom-up on every commit** from sorted in-memory rows. This is correct-by-construction but assumes a single committer β concurrent committers would each want to rebuild the same tree from a different in-memory state.
- The pager owns three page maps (`on_disk` / `wal_cache` / `staged`), all `HashMap<u32, Box<[u8; PAGE_SIZE]>>`. There is no per-page locking or versioning; reads inside the engine are sequential.
### 3.4 Connection model ([`src/connection.rs`](../src/connection.rs))
`Connection` owns its `Database` by value. It is `Send`, **not `Sync`**, and has no `clone()`. The engine's documented embedding pattern when callers want multi-threaded access is to wrap the connection in `Mutex<Connection>` β i.e. serialize access externally.
### 3.5 What this means
To get from "one writer, one process" to "many writers, one process, conflict-detected at commit" we need to change every layer:
1. **Connection layer**: turn `Connection` into a thin handle on a shared `Database`/`Pager` so multiple connections can coexist within a process. Today there's only ever one `Connection` per `Database`.
2. **Transaction layer**: replace deep-clone snapshots with per-transaction read/write sets and a logical clock. Snapshots scale O(database size) on every BEGIN, which is unworkable with many concurrent transactions.
3. **Storage layer**: introduce an MVCC version index sitting in front of the existing pager. Writes land in the index and only become pager-visible at checkpoint time; reads consult the index first, the pager second.
4. **Commit path**: validate write sets against committed-after-begin versions. On conflict, return a typed `Busy` error and let the caller retry. On success, append a WAL transaction record covering the row deltas.
5. **Checkpoint**: drain the version index into the existing bottom-up B-tree rebuild path. The legacy commit path doesn't go away β `BEGIN` and `BEGIN IMMEDIATE` keep their current single-writer semantics β but `BEGIN CONCURRENT` short-circuits past it until checkpoint time.
The good news: the existing WAL and the existing on-disk format are reusable as-is. Turso's checkpoint design ("MVCC log β page cache β WAL β main file") works because the WAL is the same WAL with or without MVCC. The same is true for SQLRite's WAL.
---
## 4. Proposed design
A high-level sketch of the layered architecture this proposal targets.
```
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Connection (handle) ββ Connection (handle) ββ ... β <- many per process
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Shared Database (Arc<DatabaseInner>) β
β β
β Schema cache β MVCC store β Logical clock β
ββββββββββββββββββββββ΄βββββββββββββββββ¬ββ΄βββββββββββββββββββ€
β β β
β Pager (existing) β WAL β main fileβ GC worker β
βββββββββββββββββββββββββββββββββββββββ΄βββββββββββββββββββββ
```
### 4.1 Logical clock
A monotonic `u64` counter, per-`Database`. Hands out `begin_ts` at `BEGIN CONCURRENT` and `commit_ts` at the start of validation. Wrapped in `AtomicU64`; no contention because each transaction calls it twice. Persisted to the WAL header on each commit so reopens resume past the highest committed timestamp. (Turso's [`MvccClock`](https://github.com/tursodatabase/turso/blob/main/core/mvcc/clock.rs) is the same shape.)
### 4.2 Version index
```rust
pub struct MvStore {
// (table_id, rowid) -> head of version chain
versions: DashMap<RowID, RowVersionChain>,
clock: MvccClock,
active: ActiveTxRegistry, // for GC: min active begin_ts
}
pub struct RowVersion {
begin: TxTsOrId, // commit_ts once committed; tx_id while in-flight
end: Option<TxTsOrId>, // None == latest visible version
payload: Row, // for v0; column deltas later (turso#3498-style)
next: Option<Arc<RowVersion>>,
}
```
A few intentional simplifications versus Turso for v0:
- One chain per row, behind `RwLock` (or `parking_lot::RwLock`). The wait-free chain is a known follow-up; it's not on the v0 critical path.
- Full-row payloads (defer column-level deltas).
- Lazy materialization from the pager on first touch. Deviates from Turso's "load whole table eagerly", which they themselves treat as debt; we'd be starting from the better default.
### 4.3 Transaction record
```rust
pub struct ConcurrentTx {
id: TxId,
begin_ts: u64,
state: TxState, // Active | Committed | Aborted
read_set: HashSet<RowID>, // for serializable-isolation upgrade later
write_set: HashMap<RowID, Arc<RowVersion>>,
schema_snapshot: Arc<SchemaSnapshot>, // pinned at BEGIN
}
```
Read-set tracking is optional for snapshot isolation but cheap to keep β and unlocks serializable later if we want it.
### 4.4 Read protocol
```
read(row_id, begin_ts):
if row_id in MvStore.versions:
walk chain, return version with begin <= begin_ts < end
else:
bytes = pager.read_via_existing_path(row_id)
materialize into MvStore.versions as a single committed-at-zero version
return it
```
### 4.5 Commit protocol
```
commit(tx):
commit_ts = clock.tick()
for row_id in tx.write_set:
chain = MvStore.versions[row_id]
head = chain.read()
if head.begin > tx.begin_ts: # someone committed under us
return Err(Busy)
for row_id, new_version in tx.write_set:
new_version.begin = commit_ts
MvStore.versions[row_id].push(new_version)
pager.append_wal_log_record(tx.write_set, commit_ts) # one fsync
return Ok
```
The validation pass walks the write set only β typically tiny. The fsync is one barrier per transaction, same cost as today's `commit_transaction`.
### 4.6 Checkpoint
Existing checkpoint logic ([Phase 4d](roadmap.md)) folds WAL frames into the main file when frame count crosses `AUTO_CHECKPOINT_THRESHOLD_FRAMES`. The MVCC variant adds a step before that: walk the version index, pick the latest committed version per row visible at `min(active begin_ts) - 1`, hand it to `save_database` (which rebuilds the B-tree). Older versions get GC'd at the same time. Schema changes still require an exclusive lock.
### 4.7 Garbage collection
A separate worker (or piggy-backed on commit) walks `MvStore.versions` and prunes versions whose `end` timestamp is below the oldest active reader's `begin_ts`. v0 can run this synchronously inside `commit` β sweep the rows touched in the write set only β and add an explicit background sweep later.
### 4.8 Compatibility with existing transactions
`BEGIN` and `BEGIN IMMEDIATE` keep their current semantics β full deep-clone snapshot, exclusive write, no row-version tracking. They remain the right choice for DDL. A `BEGIN IMMEDIATE` running on the same connection blocks new `BEGIN CONCURRENT` commits (they get `Busy`) and waits for in-flight ones to finish. This matches Turso's "exclusive vs concurrent" coexistence rule.
---
## 5. Phased plan
Sequenced so each sub-phase is independently shippable, follows SQLRite's existing roadmap discipline (every phase ends with a working build + full test suite + a commit on `main`), and the engine never regresses while in flight. Tentatively numbered as **Phase 10** (Phase 9 = benchmarks; Phase 8 = FTS, shipped). Slot may shift depending on roadmap rebalancing.
### Phase 10.1 β Multi-connection foundation
Goal: more than one `Connection` can target the same `Database` within a process, with no behavior change otherwise.
- Split `Database` into `DatabaseInner` (state) + `Connection` (handle that holds `Arc<DatabaseInner>`).
- Move the `Pager` behind a `Mutex` or `RwLock` inside `DatabaseInner`.
- New `Database::connect() -> Connection`. The existing `Connection::open` path opens the database and returns a freshly-connected handle.
- Make the engine `Send + Sync` end-to-end. No new SQL surface yet.
- Tests: spawn N threads, each running independent `BEGIN; INSERT; COMMIT` against the same DB, prove the existing single-writer locking still serializes them correctly with no panics.
### Phase 10.2 β Logical clock + transaction registry
- Add `MvccClock` with `tick() -> u64` + `now() -> u64`.
- Add `ActiveTxRegistry` exposing `min_active_begin_ts()` for the GC.
- Persist the clock high-water-mark in the WAL header (extend `WalHeader` with a 16-byte field; bump `WAL_FORMAT_VERSION` from 1 β 2 β backward-compatible because the WAL is regenerated on checkpoint).
- Tests: round-trip the clock through reopen, reject WAL files where the persisted timestamp is non-monotonic.
### Phase 10.3 β `MvStore` skeleton + reads
- Implement `MvStore` and `RowVersion`. v0: one chain per row behind `RwLock<Vec<RowVersion>>`.
- Wire reads through the store: lazy-load on first touch, return the visible version for the caller's `begin_ts`.
- New SQL gate: `PRAGMA journal_mode = mvcc` switches the connection into MVCC mode for reads. Writes still go through the legacy path.
- Tests: snapshot isolation reads β one connection inserts and commits, a second connection that began before the commit doesn't see the new row; a third connection that began after does.
### Phase 10.4 β `BEGIN CONCURRENT` writes
- Extend the parser to map `BEGIN CONCURRENT` (sqlparser already recognizes the modifier) to a new `TxKind::Concurrent`.
- Route writes from `TxKind::Concurrent` transactions into the `MvStore`'s write set rather than the live `tables` map.
- Implement the commit-time validation loop. New `SQLRiteError::Busy` and `SQLRiteError::BusySnapshot` variants ([decision: use both, mirroring Turso, so SDKs can map cleanly](https://github.com/tursodatabase/turso/blob/main/core/error.rs)).
- WAL log record format: a new frame kind carrying `(table_id, rowid, op, payload)` tuples. Distinct from the existing per-page commit frame; the checkpointer flattens log records into page-level updates.
- Tests:
- two concurrent inserts on disjoint rowids both commit
- two concurrent updates on the same rowid: one commits, one aborts with `Busy`
- aborted transaction's writes never become visible
- retry-after-`Busy` succeeds
### Phase 10.5 β Checkpoint + crash recovery
> **Status (roadmap 11.9 β May 2026):** The crash-recovery half landed in roadmap Phase 11.9. WAL format is bumped to v3; commits append a typed `MvccCommitBatch` frame before the legacy save's fsync; reopen replays those frames into `MvStore` and seeds `MvccClock` past the highest `commit_ts`. The checkpoint-drain half β folding MVCC log records into pager-level updates and re-enabling the `Mvcc β Wal` journal-mode downgrade β is the remaining slice and stays parked for a follow-up.
- ~~Extend the checkpointer to drain MVCC log records into pager-level updates before folding the WAL into the main file.~~ *Deferred β see status note above.*
- Crash recovery: on open, replay WAL log records into `MvStore`, then replay pager-level commit frames as today. **(Shipped β 11.9.)**
- Tests: kill the process mid-MVCC-commit (between log-record append and version-chain push), reopen, verify the committed transaction is visible and the half-written one is not. **(Shipped β 11.9 covers the clean-drop case which exercises the same recovery codepath; a real OS-kill test is parked with the checkpoint-drain follow-up.)**
### Phase 10.6 β Garbage collection
- Per-commit sweep over the write set's chains.
- Background sweep behind a new `PRAGMA mvcc_gc_interval_ms` (default 1000).
- Memory-pressure trigger: if `MvStore` size > `mvcc_max_memory_bytes`, force a sweep at the next safe point.
- Tests: chain length stays bounded under heavy single-row update pressure.
### Phase 10.7 β Indexes under MVCC (deferred-by-design β separate later phase)
Index maintenance under MVCC is hard enough that Turso explicitly punted on it. SQLRite should ship the v0 with `CREATE INDEX` rejected when `journal_mode = mvcc`, and tackle indexes as a follow-up phase once the v0 is stable. Sketched solution: each secondary-index entry becomes a `RowVersion` itself, keyed by `(index_id, key, rowid)`. The cost is one version chain per indexed (column, row) pair.
### Phase 10.8 β Public API + SDK propagation
- Surface `Connection::begin_concurrent() -> Result<()>` as a typed convenience wrapper.
- Surface `SQLRiteError::Busy` and `SQLRiteError::BusySnapshot` through the FFI shim ([`sqlrite-ffi/`](../sqlrite-ffi/)) and each SDK ([Python](../sdk/python/), [Node](../sdk/nodejs/), [Go](../sdk/go/), [WASM](../sdk/wasm/)). The right level of abstraction: an exception per SDK plus a typed retry helper.
- New REPL meta-command `.spawn` that opens an additional connection to the same DB so users can demo `BEGIN CONCURRENT` interactively.
- New benchmark workload in [`benchmarks/`](../benchmarks/) that pits SQLRite-MVCC against SQLite + SQLRite-default + Turso on a "N writers, mostly disjoint rows" scenario. Slots into the existing SQLR-16 harness as a Group D differentiator workload.
### Phase 10.9 β Docs
> **Status (roadmap 11.12 β May 2026):** Shipped. The canonical user-facing reference at [`docs/concurrent-writes.md`](concurrent-writes.md) covers the SQL surface, embedding API, SDK error mapping, REPL meta-commands, durability story, and limitations as of Phase 11.11a. This plan-doc is now the historical record. Cross-references in `_index.md`, `supported-sql.md`, `embedding.md`, and `design-decisions.md` point at the canonical doc; a runnable example lives at [`examples/rust/concurrent_writers.rs`](../examples/rust/concurrent_writers.rs).
- Promote this plan to `docs/concurrent-writes.md` (the canonical user-facing reference), keeping `concurrent-writes-plan.md` as the historical design document. **(Shipped β 11.12.)**
- Update [roadmap.md](roadmap.md), [`docs/_index.md`](_index.md), [supported-sql.md](supported-sql.md), [embedding.md](embedding.md), [design-decisions.md](design-decisions.md). **(Shipped β 11.12.)**
- Add a worked example under `examples/rust/concurrent_writers.rs`. **(Shipped β 11.12.)**
---
## 6. SQL & API surface
```sql
-- Opt in (per connection, per session)
PRAGMA journal_mode = mvcc;
-- Snapshot-isolation transaction with row-level conflict detection at commit
BEGIN CONCURRENT;
UPDATE accounts SET balance = balance - 50 WHERE id = 1;
UPDATE accounts SET balance = balance + 50 WHERE id = 2;
COMMIT; -- may return SQLRITE_BUSY β caller retries
-- Existing transactions still work (single-writer, exclusive)
BEGIN; -- aka BEGIN DEFERRED, identical to today
BEGIN IMMEDIATE; -- new alias of BEGIN; documented for Turso parity
```
```rust
// Rust embedding API
let conn = db.connect()?; // new β multi-connection
conn.execute("PRAGMA journal_mode = mvcc")?;
loop {
conn.execute("BEGIN CONCURRENT")?;
conn.execute("UPDATE accounts SET balance = balance - 50 WHERE id = 1")?;
conn.execute("UPDATE accounts SET balance = balance + 50 WHERE id = 2")?;
match conn.execute("COMMIT") {
Ok(_) => break,
Err(SQLRiteError::Busy | SQLRiteError::BusySnapshot) => {
conn.execute("ROLLBACK").ok();
continue;
}
Err(e) => return Err(e),
}
}
```
A typed `Connection::with_concurrent_tx<F>(f: F)` helper that hides the retry loop is worth considering once the bare API is shaken out.
---
## 7. File-format implications
- The on-disk `.sqlrite` file format (currently v5) **does not change**. MVCC state lives in memory and in the WAL.
- The WAL header gains a `clock_high_water: u64` field. Bump `WAL_FORMAT_VERSION` from 1 β 2. Pre-v2 WALs open as v1 (clock starts from 0); the next checkpoint rewrites the header at v2.
- A new WAL frame kind for MVCC log records. Existing per-page commit frames keep their format. Both kinds can coexist in the same WAL β the checkpointer drains log records first, then per-page commits, then truncates.
This keeps `sqlrite-engine` v0.x readers compatible with v1.x files and vice versa as long as MVCC isn't enabled at write time.
---
## 8. Risks, open questions, and explicit non-goals
### Risks
- **Bottom-up B-tree rebuild ([Decision 6](design-decisions.md))** is the biggest architectural mismatch. Today every commit rebuilds the whole tree. With MVCC the rebuild only runs at checkpoint, so the cost is amortized β but tables that grow large between checkpoints can produce huge checkpoint stalls. Mitigation: in-place B-tree updates are an obvious follow-up, separately tracked.
- **Memory growth** β the version index holds full row payloads. Mitigation: column-level deltas (Turso's [#3498](https://github.com/tursodatabase/turso/issues/3498)) and an `mvcc_max_memory_bytes` cap.
- **Lock contention on the per-row chain** β `RwLock<Vec<RowVersion>>` is good enough for v0 but won't scale to extreme concurrency. Mitigation: wait-free chain (Turso's [#3499](https://github.com/tursodatabase/turso/issues/3499)) as a v1 follow-up.
- **DDL under MVCC** β schema changes need an exclusive promotion of the lock. v0 should reject DDL inside a `BEGIN CONCURRENT` block with a clean error.
- **Multi-process MVCC is out of scope.** Turso is intra-process; the in-memory version index has no cross-process visibility, and adding it would require a shared-memory coordination file (the same machinery SQLite's WAL uses for read marks). Multi-process writes continue to be serialized by the existing `flock(LOCK_EX)`. Documented as a non-goal until someone has a workload that justifies the engineering.
### Open questions
- **Should `journal_mode = mvcc` be per-connection or per-database?** Turso treats it per-connection; SQLite's `journal_mode` is per-database. Per-connection is more flexible (a maintenance connection can stay in WAL mode while app connections use MVCC), but per-database is closer to user expectation.
- **Should `MvStore` materialize the whole table on first touch, or one row at a time?** Per-row is cheaper and avoids Turso's eager-load complaint. The cost is one `pager.read` per cold row.
- **Does `BEGIN CONCURRENT` upgrade to exclusive on the first write to a versioned table that has no `MvStore` entry?** The simpler answer is no β first read materializes the row into the index, no lock upgrade needed.
- **AUTOINCREMENT.** Globally-incrementing rowid counters are not naturally versioned. Acceptable v0 answer: reject INSERT into an AUTOINCREMENT table from a `BEGIN CONCURRENT` block (matches Turso). Better answer: a serializable counter sequence service. Defer.
- **Hermitage-style anomaly tests.** Turso's [`hermitage_tests.rs`](https://github.com/tursodatabase/turso/blob/main/core/mvcc/database/hermitage_tests.rs) is a port of [Martin Kleppmann's hermitage suite](https://github.com/ept/hermitage). SQLRite should adopt the same tests as the snapshot-isolation conformance bar.
### Non-goals (v0)
- Concurrent reads of an in-progress checkpoint (Turso's `wal_checkpoint(TRUNCATE)` blocks; ours can too).
- Automatic backoff in the SDK retry helpers β the caller picks the policy.
- Cross-process MVCC.
- Asynchronous I/O / `io_uring`.
---
## 9. Estimated effort
Order-of-magnitude only β each sub-phase is a roadmap entry and gets a real estimate when scheduled.
| Sub-phase | Rough size | Notes |
|---|---|---|
| 10.1 multi-connection | M | mostly mechanical, tests heavy |
| 10.2 clock + tx registry | S | small, isolated |
| 10.3 `MvStore` reads | M | new module, lazy materialization |
| 10.4 `BEGIN CONCURRENT` writes | L | the meat |
| 10.5 checkpoint + recovery | M | reuses existing checkpointer |
| 10.6 GC | SβM | one synchronous + one background |
| 10.7 indexes under MVCC | L | deferred |
| 10.8 SDK propagation | M | one slice per SDK |
| 10.9 docs | S | one pass once 10.4 lands |
Total before 10.7: β 4β6 weeks of focused work for a single engineer, given the existing test discipline. Index maintenance under MVCC (10.7) is its own multi-week effort and may slip to a separate phase.
---
## 10. References
- [Turso concurrent writes β official docs](https://docs.turso.tech/tursodb/concurrent-writes)
- [Turso source β `core/mvcc/`](https://github.com/tursodatabase/turso/tree/main/core/mvcc)
- [Turso blog β Beyond the single-writer limitation](https://turso.tech/blog/beyond-the-single-writer-limitation-with-tursos-concurrent-writes)
- [Turso blog β v0.5.0 release](https://turso.tech/blog/turso-0.5.0)
- [Turso DeepWiki β MVCC architecture overview](https://deepwiki.com/tursodatabase/turso)
- Larson, Blanas, Diaconu, Freedman, Patel, Zwilling β [*High-Performance Concurrency Control Mechanisms for Main-Memory Databases*, VLDB 2011](https://www.microsoft.com/en-us/research/wp-content/uploads/2011/01/main-mem-cc-techreport.pdf) β the Hekaton MVCC paper Turso cites.
- Levandoski, Lomet, Sengupta β [*The Bw-Tree: A B-tree for New Hardware Platforms*, ICDE 2013](https://www.microsoft.com/en-us/research/publication/the-bw-tree-a-b-tree-for-new-hardware-platforms/) β referenced by Turso for the lock-free index path.
- [SQLite `BEGIN CONCURRENT` (experimental, page-level)](https://sqlite.org/cgi/src/doc/begin-concurrent/doc/begin_concurrent.md)
- Martin Kleppmann β [Hermitage anomaly test suite](https://github.com/ept/hermitage)
- SQLRite internals: [pager.md](pager.md), [file-format.md](file-format.md), [design-decisions.md](design-decisions.md), [roadmap.md](roadmap.md)