sqlrite-engine 0.10.0

Light version of SQLite developed with Rust. Published as `sqlrite-engine` on crates.io; import as `use sqlrite::…`.
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
# Concurrent writes plan β€” MVCC + `BEGIN CONCURRENT`

> πŸ“˜ **Looking for the user-facing reference?** This is the original
> design proposal, kept as the historical record of the decisions
> that shaped Phase 11. For the shipped surface β€” SQL, embedding API,
> SDK error mapping, REPL meta-commands, durability story,
> limitations β€” read [**`concurrent-writes.md`**](concurrent-writes.md)
> first; come back here when you want the *why* and the
> sequencing discussion.

**Status:** shipped end-to-end through Phase 11.11a (May 2026); a small set of follow-ups remain explicitly parked β€” see the [roadmap](roadmap.md#phase-11--concurrent-writes-via-mvcc--begin-concurrent-sqlr-22-in-flight--see-concurrent-writes-planmd). Drafted 2026-05-07.
**Inspiration:** [Turso](https://turso.tech) β€” a SQLite-compatible engine, written in Rust, that implements multi-version concurrency control to lift SQLite's single-writer ceiling. See [`turso/core/mvcc/`](https://github.com/tursodatabase/turso/tree/main/core/mvcc) and the [Turso concurrent-writes docs](https://docs.turso.tech/tursodb/concurrent-writes).
**Tracks:** [SQLR-22](https://app.marvinapp.io/) (Marvin).

This document proposes adding **multi-version concurrency control (MVCC)** and a **`BEGIN CONCURRENT`** transaction mode to SQLRite, enabling multiple writers in the same process to make progress in parallel under snapshot isolation, with row-level write-write conflict detection at commit. It is intentionally a *plan* β€” there is no code yet.

The output of this work, when shipped, would put SQLRite in the same conceptual bucket as Turso's `--experimental-mvcc`: the file format stays compatible with the existing v5 layout, the legacy serialized path keeps working, and apps opt in by setting a journal mode and using a new `BEGIN CONCURRENT` statement.

---

## 1. Why bother

SQLite's single-writer rule is the design choice users hit first when they scale. It's not a bug β€” it's the consequence of a single global file lock (`PENDING`/`EXCLUSIVE`) that serializes commit ordering β€” but in workloads where two concurrent writers touch *disjoint* rows it costs throughput unnecessarily.

Turso's measurements report **up to 4Γ— the write throughput of SQLite** when the workload is dominated by non-conflicting writes ([Turso v0.5.0 blog post](https://turso.tech/blog/turso-0.5.0); [Beyond the single-writer limitation](https://turso.tech/blog/beyond-the-single-writer-limitation-with-tursos-concurrent-writes)). For SQLRite β€” which today inherits SQLite's single-writer model end-to-end (advisory `flock(LOCK_EX)` plus a single `Pager` per process; see [pager.md](pager.md) and [Decision 11 in design-decisions.md](design-decisions.md)) β€” the same uplift is theoretically available, *if* the engine learns to track row versions and detect write-write conflicts at commit instead of holding an exclusive write lock for the duration.

This is also a natural fit for SQLRite's stated remit (a hands-on study of database internals). MVCC, snapshot isolation, optimistic concurrency control, and version-chain garbage collection are textbook database-systems topics; implementing them turns the engine into a credible reference for those topics rather than a SQLite clone with extra storage tricks.

---

## 2. What Turso does

Turso's MVCC implementation is the most directly relevant prior art β€” same target SQL, same general layout, same Rust language, and they've already navigated most of the design questions we'd face. A summary of the moving parts (cross-checked against [`core/mvcc/mod.rs`](https://github.com/tursodatabase/turso/blob/main/core/mvcc/mod.rs) and [`core/mvcc/database/mod.rs`](https://github.com/tursodatabase/turso/blob/main/core/mvcc/database/mod.rs)):

### 2.1 Activation

```sql
PRAGMA journal_mode = mvcc;     -- per-connection switch
BEGIN CONCURRENT;
-- writes
COMMIT;                          -- may return SQLITE_BUSY β†’ caller retries
```

Compiled in behind `--experimental-mvcc`. `BEGIN CONCURRENT` doesn't acquire any locks; `BEGIN IMMEDIATE` / `BEGIN DEFERRED` retain SQLite's exclusive-write semantics.

### 2.2 In-memory version index

The MVCC store keeps an in-memory map keyed by `RowID { table_id, row_key }` whose value is a chain of `RowVersion` records. Each version carries:

- `begin: TxTimestampOrID` β€” the commit timestamp (or transaction ID, while uncommitted) at which this version becomes visible
- `end: Option<TxTimestampOrID>` β€” when it stops being visible (a later UPDATE/DELETE pushes a new head onto the chain)
- the row payload itself

Visibility for a reader transaction with begin-timestamp `T` is the textbook snapshot-isolation rule: pick the version whose `begin <= T < end`.

The Hekaton paper (Larson et al., *VLDB 2011*) is the explicit reference. The same paper also covers the validation phase used at commit and the garbage-collection rules sketched in Β§7 below.

### 2.3 Read path

- If the row is present in the MVCC index β†’ use it (latest-wins; all writes funnel through the index, so the in-memory copy is the source of truth).
- Otherwise β†’ fall through to the existing pager β†’ WAL β†’ main-file read path. Result is materialized into the index on first touch ("eager loaded" today, which is the source of Turso's memory-pressure caveats).

### 2.4 Write path

Writes go entirely to the MVCC index, tagged with the transaction ID. They're invisible to other transactions until the writer commits. The transaction also tracks a **read set** and a **write set** for validation.

### 2.5 Commit / conflict detection

Three phases:

1. Take a commit timestamp from a logical clock.
2. Walk the write set; for each row, check whether *any* version newer than the transaction's begin timestamp exists in the index. If yes β†’ write-write conflict. The transaction is aborted and the caller sees `SQLITE_BUSY` (or, in libSQL terms, `Busy` / `BusySnapshot`).
3. Otherwise, stamp the new versions with the commit timestamp, append a WAL log record describing the row deltas, and fsync. The pager's WAL machinery is reused unchanged.

Conflict detection is **row-level**, not page-level. This is the critical contrast with the upstream SQLite `BEGIN CONCURRENT` patch, which only resolves at page granularity and forces aborts when two transactions touch unrelated rows that happen to share a page.

### 2.6 Checkpointing

The MVCC log is "eventually checkpointed into the SQLite database file via the WAL." Specifically: the writer that checkpoints walks the committed-but-unmaterialized portion of the version chain, materializes the latest visible version of each row into the page cache, and lets the regular pager flush it to the WAL/main file. Recovery on reopen relies on the WAL plus the persistent on-disk state β€” the in-memory MVCC log is rebuilt empty.

### 2.7 Garbage collection

Versions whose `end` timestamp is older than the oldest active reader's begin-timestamp are dead and may be reclaimed. Turso's implementation today uses a contiguous `Vec` per row guarded by an `RwLock`, with a scan-based GC pass that runs opportunistically. The known issue ([turso#3499](https://github.com/tursodatabase/turso/issues/3499)) is that the lock is the bottleneck under heavy contention; a wait-free chain is on their roadmap.

### 2.8 Limitations (as of writing)

- `CREATE INDEX` is not yet supported in MVCC mode β€” neither maintaining secondary indexes through the version chain nor reading them under snapshot isolation.
- `AUTOINCREMENT` is not supported (the global counter is not yet versioned).
- `wal_checkpoint(TRUNCATE)` is the only checkpoint variant; it blocks readers and writers.
- "Eager loading" β€” first access to a table loads it fully into the version index. Memory cost scales with the table, not the working set.
- Row deltas are full row copies, not column-level diffs ([turso#3498](https://github.com/tursodatabase/turso/issues/3498)).
- I/O is synchronous; `io_uring` adoption is on the roadmap ([turso#1848](https://github.com/tursodatabase/turso/issues/1848)).

These are honest limits, not gotchas. They suggest where SQLRite's plan should explicitly punt, and where it might do better by virtue of being smaller.

---

## 3. SQLRite's current concurrency model

A clear-eyed inventory of what we have today, since the gap is what the plan has to bridge.

### 3.1 Process-level locking ([`src/sql/pager/pager.rs`](../src/sql/pager/pager.rs))

`Pager::open` takes `flock(LOCK_EX | LOCK_NB)` on both the main file and the WAL sidecar via `fs2`. `Pager::open_read_only` takes `flock(LOCK_SH | LOCK_NB)`. Reads and writes are mutually exclusive *across* processes β€” POSIX flock semantics, no shared-memory coordination file. **Within** a process, the engine runs one writer because the public surface only exposes one `Connection` per `Database`/`Pager`.

### 3.2 Transactions ([`src/sql/db/database.rs:149-195`](../src/sql/db/database.rs))

`BEGIN` deep-clones the entire `tables: HashMap<String, Table>` into a `TxnSnapshot` stashed on `Database::txn`. `ROLLBACK` swaps it back. `COMMIT` clears the snapshot and the next `save_database` call appends one WAL commit frame. Auto-save is suppressed while a transaction is open. Nested begins are rejected. There is no concept of read-set / write-set tracking, no logical clock, no row-version metadata.

### 3.3 Storage ([`src/sql/pager/`](../src/sql/pager/) + [Decision 6 in design-decisions.md](design-decisions.md))

- 4 KiB pages, cell-encoded B-trees per table and per secondary index.
- **`save_database` rebuilds every B-tree bottom-up on every commit** from sorted in-memory rows. This is correct-by-construction but assumes a single committer β€” concurrent committers would each want to rebuild the same tree from a different in-memory state.
- The pager owns three page maps (`on_disk` / `wal_cache` / `staged`), all `HashMap<u32, Box<[u8; PAGE_SIZE]>>`. There is no per-page locking or versioning; reads inside the engine are sequential.

### 3.4 Connection model ([`src/connection.rs`](../src/connection.rs))

`Connection` owns its `Database` by value. It is `Send`, **not `Sync`**, and has no `clone()`. The engine's documented embedding pattern when callers want multi-threaded access is to wrap the connection in `Mutex<Connection>` β€” i.e. serialize access externally.

### 3.5 What this means

To get from "one writer, one process" to "many writers, one process, conflict-detected at commit" we need to change every layer:

1. **Connection layer**: turn `Connection` into a thin handle on a shared `Database`/`Pager` so multiple connections can coexist within a process. Today there's only ever one `Connection` per `Database`.
2. **Transaction layer**: replace deep-clone snapshots with per-transaction read/write sets and a logical clock. Snapshots scale O(database size) on every BEGIN, which is unworkable with many concurrent transactions.
3. **Storage layer**: introduce an MVCC version index sitting in front of the existing pager. Writes land in the index and only become pager-visible at checkpoint time; reads consult the index first, the pager second.
4. **Commit path**: validate write sets against committed-after-begin versions. On conflict, return a typed `Busy` error and let the caller retry. On success, append a WAL transaction record covering the row deltas.
5. **Checkpoint**: drain the version index into the existing bottom-up B-tree rebuild path. The legacy commit path doesn't go away β€” `BEGIN` and `BEGIN IMMEDIATE` keep their current single-writer semantics β€” but `BEGIN CONCURRENT` short-circuits past it until checkpoint time.

The good news: the existing WAL and the existing on-disk format are reusable as-is. Turso's checkpoint design ("MVCC log β†’ page cache β†’ WAL β†’ main file") works because the WAL is the same WAL with or without MVCC. The same is true for SQLRite's WAL.

---

## 4. Proposed design

A high-level sketch of the layered architecture this proposal targets.

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Connection (handle) ── Connection (handle) ── ...       β”‚   <- many per process
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚              Shared Database (Arc<DatabaseInner>)         β”‚
β”‚                                                          β”‚
β”‚    Schema cache    β”‚    MVCC store    β”‚   Logical clock  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                     β”‚                    β”‚
β”‚   Pager (existing) ─ WAL ─ main fileβ”‚   GC worker        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

### 4.1 Logical clock

A monotonic `u64` counter, per-`Database`. Hands out `begin_ts` at `BEGIN CONCURRENT` and `commit_ts` at the start of validation. Wrapped in `AtomicU64`; no contention because each transaction calls it twice. Persisted to the WAL header on each commit so reopens resume past the highest committed timestamp. (Turso's [`MvccClock`](https://github.com/tursodatabase/turso/blob/main/core/mvcc/clock.rs) is the same shape.)

### 4.2 Version index

```rust
pub struct MvStore {
    // (table_id, rowid) -> head of version chain
    versions: DashMap<RowID, RowVersionChain>,
    clock: MvccClock,
    active: ActiveTxRegistry,   // for GC: min active begin_ts
}

pub struct RowVersion {
    begin: TxTsOrId,            // commit_ts once committed; tx_id while in-flight
    end:   Option<TxTsOrId>,    // None == latest visible version
    payload: Row,               // for v0; column deltas later (turso#3498-style)
    next: Option<Arc<RowVersion>>,
}
```

A few intentional simplifications versus Turso for v0:

- One chain per row, behind `RwLock` (or `parking_lot::RwLock`). The wait-free chain is a known follow-up; it's not on the v0 critical path.
- Full-row payloads (defer column-level deltas).
- Lazy materialization from the pager on first touch. Deviates from Turso's "load whole table eagerly", which they themselves treat as debt; we'd be starting from the better default.

### 4.3 Transaction record

```rust
pub struct ConcurrentTx {
    id: TxId,
    begin_ts: u64,
    state: TxState,                          // Active | Committed | Aborted
    read_set: HashSet<RowID>,                // for serializable-isolation upgrade later
    write_set: HashMap<RowID, Arc<RowVersion>>,
    schema_snapshot: Arc<SchemaSnapshot>,    // pinned at BEGIN
}
```

Read-set tracking is optional for snapshot isolation but cheap to keep β€” and unlocks serializable later if we want it.

### 4.4 Read protocol

```
read(row_id, begin_ts):
    if row_id in MvStore.versions:
        walk chain, return version with begin <= begin_ts < end
    else:
        bytes = pager.read_via_existing_path(row_id)
        materialize into MvStore.versions as a single committed-at-zero version
        return it
```

### 4.5 Commit protocol

```
commit(tx):
    commit_ts = clock.tick()
    for row_id in tx.write_set:
        chain = MvStore.versions[row_id]
        head = chain.read()
        if head.begin > tx.begin_ts:        # someone committed under us
            return Err(Busy)
    for row_id, new_version in tx.write_set:
        new_version.begin = commit_ts
        MvStore.versions[row_id].push(new_version)
    pager.append_wal_log_record(tx.write_set, commit_ts)   # one fsync
    return Ok
```

The validation pass walks the write set only β€” typically tiny. The fsync is one barrier per transaction, same cost as today's `commit_transaction`.

### 4.6 Checkpoint

Existing checkpoint logic ([Phase 4d](roadmap.md)) folds WAL frames into the main file when frame count crosses `AUTO_CHECKPOINT_THRESHOLD_FRAMES`. The MVCC variant adds a step before that: walk the version index, pick the latest committed version per row visible at `min(active begin_ts) - 1`, hand it to `save_database` (which rebuilds the B-tree). Older versions get GC'd at the same time. Schema changes still require an exclusive lock.

### 4.7 Garbage collection

A separate worker (or piggy-backed on commit) walks `MvStore.versions` and prunes versions whose `end` timestamp is below the oldest active reader's `begin_ts`. v0 can run this synchronously inside `commit` β€” sweep the rows touched in the write set only β€” and add an explicit background sweep later.

### 4.8 Compatibility with existing transactions

`BEGIN` and `BEGIN IMMEDIATE` keep their current semantics β€” full deep-clone snapshot, exclusive write, no row-version tracking. They remain the right choice for DDL. A `BEGIN IMMEDIATE` running on the same connection blocks new `BEGIN CONCURRENT` commits (they get `Busy`) and waits for in-flight ones to finish. This matches Turso's "exclusive vs concurrent" coexistence rule.

---

## 5. Phased plan

Sequenced so each sub-phase is independently shippable, follows SQLRite's existing roadmap discipline (every phase ends with a working build + full test suite + a commit on `main`), and the engine never regresses while in flight. Tentatively numbered as **Phase 10** (Phase 9 = benchmarks; Phase 8 = FTS, shipped). Slot may shift depending on roadmap rebalancing.

### Phase 10.1 β€” Multi-connection foundation

Goal: more than one `Connection` can target the same `Database` within a process, with no behavior change otherwise.

- Split `Database` into `DatabaseInner` (state) + `Connection` (handle that holds `Arc<DatabaseInner>`).
- Move the `Pager` behind a `Mutex` or `RwLock` inside `DatabaseInner`.
- New `Database::connect() -> Connection`. The existing `Connection::open` path opens the database and returns a freshly-connected handle.
- Make the engine `Send + Sync` end-to-end. No new SQL surface yet.
- Tests: spawn N threads, each running independent `BEGIN; INSERT; COMMIT` against the same DB, prove the existing single-writer locking still serializes them correctly with no panics.

### Phase 10.2 β€” Logical clock + transaction registry

- Add `MvccClock` with `tick() -> u64` + `now() -> u64`.
- Add `ActiveTxRegistry` exposing `min_active_begin_ts()` for the GC.
- Persist the clock high-water-mark in the WAL header (extend `WalHeader` with a 16-byte field; bump `WAL_FORMAT_VERSION` from 1 β†’ 2 β€” backward-compatible because the WAL is regenerated on checkpoint).
- Tests: round-trip the clock through reopen, reject WAL files where the persisted timestamp is non-monotonic.

### Phase 10.3 β€” `MvStore` skeleton + reads

- Implement `MvStore` and `RowVersion`. v0: one chain per row behind `RwLock<Vec<RowVersion>>`.
- Wire reads through the store: lazy-load on first touch, return the visible version for the caller's `begin_ts`.
- New SQL gate: `PRAGMA journal_mode = mvcc` switches the connection into MVCC mode for reads. Writes still go through the legacy path.
- Tests: snapshot isolation reads β€” one connection inserts and commits, a second connection that began before the commit doesn't see the new row; a third connection that began after does.

### Phase 10.4 β€” `BEGIN CONCURRENT` writes

- Extend the parser to map `BEGIN CONCURRENT` (sqlparser already recognizes the modifier) to a new `TxKind::Concurrent`.
- Route writes from `TxKind::Concurrent` transactions into the `MvStore`'s write set rather than the live `tables` map.
- Implement the commit-time validation loop. New `SQLRiteError::Busy` and `SQLRiteError::BusySnapshot` variants ([decision: use both, mirroring Turso, so SDKs can map cleanly](https://github.com/tursodatabase/turso/blob/main/core/error.rs)).
- WAL log record format: a new frame kind carrying `(table_id, rowid, op, payload)` tuples. Distinct from the existing per-page commit frame; the checkpointer flattens log records into page-level updates.
- Tests:
  - two concurrent inserts on disjoint rowids both commit
  - two concurrent updates on the same rowid: one commits, one aborts with `Busy`
  - aborted transaction's writes never become visible
  - retry-after-`Busy` succeeds

### Phase 10.5 β€” Checkpoint + crash recovery

> **Status (roadmap 11.9 β€” May 2026):** The crash-recovery half landed in roadmap Phase 11.9. WAL format is bumped to v3; commits append a typed `MvccCommitBatch` frame before the legacy save's fsync; reopen replays those frames into `MvStore` and seeds `MvccClock` past the highest `commit_ts`. The checkpoint-drain half β€” folding MVCC log records into pager-level updates and re-enabling the `Mvcc β†’ Wal` journal-mode downgrade β€” is the remaining slice and stays parked for a follow-up.

- ~~Extend the checkpointer to drain MVCC log records into pager-level updates before folding the WAL into the main file.~~ *Deferred β€” see status note above.*
- Crash recovery: on open, replay WAL log records into `MvStore`, then replay pager-level commit frames as today. **(Shipped β€” 11.9.)**
- Tests: kill the process mid-MVCC-commit (between log-record append and version-chain push), reopen, verify the committed transaction is visible and the half-written one is not. **(Shipped β€” 11.9 covers the clean-drop case which exercises the same recovery codepath; a real OS-kill test is parked with the checkpoint-drain follow-up.)**

### Phase 10.6 β€” Garbage collection

- Per-commit sweep over the write set's chains.
- Background sweep behind a new `PRAGMA mvcc_gc_interval_ms` (default 1000).
- Memory-pressure trigger: if `MvStore` size > `mvcc_max_memory_bytes`, force a sweep at the next safe point.
- Tests: chain length stays bounded under heavy single-row update pressure.

### Phase 10.7 β€” Indexes under MVCC (deferred-by-design β€” separate later phase)

Index maintenance under MVCC is hard enough that Turso explicitly punted on it. SQLRite should ship the v0 with `CREATE INDEX` rejected when `journal_mode = mvcc`, and tackle indexes as a follow-up phase once the v0 is stable. Sketched solution: each secondary-index entry becomes a `RowVersion` itself, keyed by `(index_id, key, rowid)`. The cost is one version chain per indexed (column, row) pair.

### Phase 10.8 β€” Public API + SDK propagation

- Surface `Connection::begin_concurrent() -> Result<()>` as a typed convenience wrapper.
- Surface `SQLRiteError::Busy` and `SQLRiteError::BusySnapshot` through the FFI shim ([`sqlrite-ffi/`](../sqlrite-ffi/)) and each SDK ([Python](../sdk/python/), [Node](../sdk/nodejs/), [Go](../sdk/go/), [WASM](../sdk/wasm/)). The right level of abstraction: an exception per SDK plus a typed retry helper.
- New REPL meta-command `.spawn` that opens an additional connection to the same DB so users can demo `BEGIN CONCURRENT` interactively.
- New benchmark workload in [`benchmarks/`](../benchmarks/) that pits SQLRite-MVCC against SQLite + SQLRite-default + Turso on a "N writers, mostly disjoint rows" scenario. Slots into the existing SQLR-16 harness as a Group D differentiator workload.

### Phase 10.9 β€” Docs

> **Status (roadmap 11.12 β€” May 2026):** Shipped. The canonical user-facing reference at [`docs/concurrent-writes.md`](concurrent-writes.md) covers the SQL surface, embedding API, SDK error mapping, REPL meta-commands, durability story, and limitations as of Phase 11.11a. This plan-doc is now the historical record. Cross-references in `_index.md`, `supported-sql.md`, `embedding.md`, and `design-decisions.md` point at the canonical doc; a runnable example lives at [`examples/rust/concurrent_writers.rs`](../examples/rust/concurrent_writers.rs).

- Promote this plan to `docs/concurrent-writes.md` (the canonical user-facing reference), keeping `concurrent-writes-plan.md` as the historical design document. **(Shipped β€” 11.12.)**
- Update [roadmap.md](roadmap.md), [`docs/_index.md`](_index.md), [supported-sql.md](supported-sql.md), [embedding.md](embedding.md), [design-decisions.md](design-decisions.md). **(Shipped β€” 11.12.)**
- Add a worked example under `examples/rust/concurrent_writers.rs`. **(Shipped β€” 11.12.)**

---

## 6. SQL & API surface

```sql
-- Opt in (per connection, per session)
PRAGMA journal_mode = mvcc;

-- Snapshot-isolation transaction with row-level conflict detection at commit
BEGIN CONCURRENT;
UPDATE accounts SET balance = balance - 50 WHERE id = 1;
UPDATE accounts SET balance = balance + 50 WHERE id = 2;
COMMIT;   -- may return SQLRITE_BUSY β†’ caller retries

-- Existing transactions still work (single-writer, exclusive)
BEGIN;             -- aka BEGIN DEFERRED, identical to today
BEGIN IMMEDIATE;   -- new alias of BEGIN; documented for Turso parity
```

```rust
// Rust embedding API
let conn = db.connect()?;          // new β€” multi-connection
conn.execute("PRAGMA journal_mode = mvcc")?;

loop {
    conn.execute("BEGIN CONCURRENT")?;
    conn.execute("UPDATE accounts SET balance = balance - 50 WHERE id = 1")?;
    conn.execute("UPDATE accounts SET balance = balance + 50 WHERE id = 2")?;
    match conn.execute("COMMIT") {
        Ok(_) => break,
        Err(SQLRiteError::Busy | SQLRiteError::BusySnapshot) => {
            conn.execute("ROLLBACK").ok();
            continue;
        }
        Err(e) => return Err(e),
    }
}
```

A typed `Connection::with_concurrent_tx<F>(f: F)` helper that hides the retry loop is worth considering once the bare API is shaken out.

---

## 7. File-format implications

- The on-disk `.sqlrite` file format (currently v5) **does not change**. MVCC state lives in memory and in the WAL.
- The WAL header gains a `clock_high_water: u64` field. Bump `WAL_FORMAT_VERSION` from 1 β†’ 2. Pre-v2 WALs open as v1 (clock starts from 0); the next checkpoint rewrites the header at v2.
- A new WAL frame kind for MVCC log records. Existing per-page commit frames keep their format. Both kinds can coexist in the same WAL β€” the checkpointer drains log records first, then per-page commits, then truncates.

This keeps `sqlrite-engine` v0.x readers compatible with v1.x files and vice versa as long as MVCC isn't enabled at write time.

---

## 8. Risks, open questions, and explicit non-goals

### Risks

- **Bottom-up B-tree rebuild ([Decision 6](design-decisions.md))** is the biggest architectural mismatch. Today every commit rebuilds the whole tree. With MVCC the rebuild only runs at checkpoint, so the cost is amortized β€” but tables that grow large between checkpoints can produce huge checkpoint stalls. Mitigation: in-place B-tree updates are an obvious follow-up, separately tracked.
- **Memory growth** β€” the version index holds full row payloads. Mitigation: column-level deltas (Turso's [#3498](https://github.com/tursodatabase/turso/issues/3498)) and an `mvcc_max_memory_bytes` cap.
- **Lock contention on the per-row chain** β€” `RwLock<Vec<RowVersion>>` is good enough for v0 but won't scale to extreme concurrency. Mitigation: wait-free chain (Turso's [#3499](https://github.com/tursodatabase/turso/issues/3499)) as a v1 follow-up.
- **DDL under MVCC** β€” schema changes need an exclusive promotion of the lock. v0 should reject DDL inside a `BEGIN CONCURRENT` block with a clean error.
- **Multi-process MVCC is out of scope.** Turso is intra-process; the in-memory version index has no cross-process visibility, and adding it would require a shared-memory coordination file (the same machinery SQLite's WAL uses for read marks). Multi-process writes continue to be serialized by the existing `flock(LOCK_EX)`. Documented as a non-goal until someone has a workload that justifies the engineering.

### Open questions

- **Should `journal_mode = mvcc` be per-connection or per-database?** Turso treats it per-connection; SQLite's `journal_mode` is per-database. Per-connection is more flexible (a maintenance connection can stay in WAL mode while app connections use MVCC), but per-database is closer to user expectation.
- **Should `MvStore` materialize the whole table on first touch, or one row at a time?** Per-row is cheaper and avoids Turso's eager-load complaint. The cost is one `pager.read` per cold row.
- **Does `BEGIN CONCURRENT` upgrade to exclusive on the first write to a versioned table that has no `MvStore` entry?** The simpler answer is no β€” first read materializes the row into the index, no lock upgrade needed.
- **AUTOINCREMENT.** Globally-incrementing rowid counters are not naturally versioned. Acceptable v0 answer: reject INSERT into an AUTOINCREMENT table from a `BEGIN CONCURRENT` block (matches Turso). Better answer: a serializable counter sequence service. Defer.
- **Hermitage-style anomaly tests.** Turso's [`hermitage_tests.rs`](https://github.com/tursodatabase/turso/blob/main/core/mvcc/database/hermitage_tests.rs) is a port of [Martin Kleppmann's hermitage suite](https://github.com/ept/hermitage). SQLRite should adopt the same tests as the snapshot-isolation conformance bar.

### Non-goals (v0)

- Concurrent reads of an in-progress checkpoint (Turso's `wal_checkpoint(TRUNCATE)` blocks; ours can too).
- Automatic backoff in the SDK retry helpers β€” the caller picks the policy.
- Cross-process MVCC.
- Asynchronous I/O / `io_uring`.

---

## 9. Estimated effort

Order-of-magnitude only β€” each sub-phase is a roadmap entry and gets a real estimate when scheduled.

| Sub-phase | Rough size | Notes |
|---|---|---|
| 10.1 multi-connection | M | mostly mechanical, tests heavy |
| 10.2 clock + tx registry | S | small, isolated |
| 10.3 `MvStore` reads | M | new module, lazy materialization |
| 10.4 `BEGIN CONCURRENT` writes | L | the meat |
| 10.5 checkpoint + recovery | M | reuses existing checkpointer |
| 10.6 GC | S–M | one synchronous + one background |
| 10.7 indexes under MVCC | L | deferred |
| 10.8 SDK propagation | M | one slice per SDK |
| 10.9 docs | S | one pass once 10.4 lands |

Total before 10.7: β‰ˆ 4–6 weeks of focused work for a single engineer, given the existing test discipline. Index maintenance under MVCC (10.7) is its own multi-week effort and may slip to a separate phase.

---

## 10. References

- [Turso concurrent writes β€” official docs](https://docs.turso.tech/tursodb/concurrent-writes)
- [Turso source β€” `core/mvcc/`](https://github.com/tursodatabase/turso/tree/main/core/mvcc)
- [Turso blog β€” Beyond the single-writer limitation](https://turso.tech/blog/beyond-the-single-writer-limitation-with-tursos-concurrent-writes)
- [Turso blog β€” v0.5.0 release](https://turso.tech/blog/turso-0.5.0)
- [Turso DeepWiki β€” MVCC architecture overview](https://deepwiki.com/tursodatabase/turso)
- Larson, Blanas, Diaconu, Freedman, Patel, Zwilling β€” [*High-Performance Concurrency Control Mechanisms for Main-Memory Databases*, VLDB 2011](https://www.microsoft.com/en-us/research/wp-content/uploads/2011/01/main-mem-cc-techreport.pdf) β€” the Hekaton MVCC paper Turso cites.
- Levandoski, Lomet, Sengupta β€” [*The Bw-Tree: A B-tree for New Hardware Platforms*, ICDE 2013](https://www.microsoft.com/en-us/research/publication/the-bw-tree-a-b-tree-for-new-hardware-platforms/) β€” referenced by Turso for the lock-free index path.
- [SQLite `BEGIN CONCURRENT` (experimental, page-level)](https://sqlite.org/cgi/src/doc/begin-concurrent/doc/begin_concurrent.md)
- Martin Kleppmann β€” [Hermitage anomaly test suite](https://github.com/ept/hermitage)
- SQLRite internals: [pager.md](pager.md), [file-format.md](file-format.md), [design-decisions.md](design-decisions.md), [roadmap.md](roadmap.md)