surrealkv 0.21.0

A low-level, versioned, embedded, ACID-compliant, key-value database for Rust
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
# [WIP] SurrealKV Architecture

SurrealKV is a versioned, embedded key-value database built on an LSM (Log-Structured Merge) tree architecture. It is designed specifically for use within SurrealDB, with the goal of reducing dependency on external storage engine (RocksDB). This approach allows the storage layer to evolve in alignment with SurrealDB’s requirements and access patterns.

This document provides a comprehensive overview of the internal design, data flow, and key subsystems.

## Table of Contents

1. [Overall Architecture]#overall-architecture
2. [Write Path]#write-path
3. [Read Path]#read-path
4. [Versioning and MVCC]#versioning-and-mvcc
5. [Compaction]#compaction
6. [Value Log (VLog)]#value-log-vlog
7. [VLog Garbage Collection]#vlog-garbage-collection
8. [Recovery]#recovery
9. [Checkpoint and Restore]#checkpoint-and-restore

---

## Overall Architecture

SurrealKV is organized into four main layers: the Client API, Core Components, Storage Layer, and Background Tasks.

### Data Flow

Data flows through the layers as follows:

1. **Writes**: Client -> Transaction (buffer) -> CommitPipeline -> WAL + MemTable -> Background Flush -> SSTables
2. **Reads**: Client -> Transaction -> Snapshot -> MemTable(s) -> SSTables -> (VLog if pointer)
3. **Compaction**: Background task merges SSTables, respecting snapshot visibility, updating VLog discard stats

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                              CLIENT LAYER                                   │
│  ┌─────────────┐      ┌─────────────────────────────────────────────────┐   │
│  │    Tree     │──────│                  Transaction                    │   │
│  │    API      │      │  ┌─────────┐  ┌─────────┐  ┌─────────────────┐  │   │
│  └─────────────┘      │  │write_set│  │ get()   │  │ commit()        │  │   │
│                       │  │ buffer  │  │ set()   │  │ rollback()      │  │   │
│                       │  └─────────┘  └─────────┘  └─────────────────┘  │   │
│                       └─────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────────────────────┐
│                              CORE COMPONENTS                               │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐    │
│  │   Oracle     │  │   Commit     │  │   Active     │  │  Immutable   │    │
│  │  (conflict   │  │   Pipeline   │  │  MemTable    │  │  MemTables   │    │
│  │  detection)  │  │ (lock-free)  │  │  (skiplist)  │  │   (queue)    │    │
│  └──────────────┘  └──────────────┘  └──────────────┘  └──────────────┘    │
│         │                 │                 │                 │            │
│         └────────────────-┼─────────────────┼─────────────────┘             │
│                          ▼                 ▼                               │
│                    ┌──────────────────────────────────────┐                │
│                    │           Snapshot                   │                │
│                    │   (point-in-time consistent view)    │                │
│                    └──────────────────────────────────────┘                │
└────────────────────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────────────────────┐
│                              STORAGE LAYER                                 │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐    │
│  │     WAL      │  │   SSTables   │  │  Value Log   │  │   Manifest   │    │
│  │  (durability)│  │   (L0..Ln)   │  │  (WiscKey)   │  │   (levels)   │    │
│  │  32KB blocks │  │  sorted runs │  │ large values │  │  table meta  │    │
│  └──────────────┘  └──────────────┘  └──────────────┘  └──────────────┘    │
└────────────────────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────────────────────┐
│                            BACKGROUND TASKS                                │
│  ┌──────────────────┐  ┌───────────────────┐  ┌──────────────────┐         │
│  │  MemTable Flush  │  │ Level Compaction  │  │     VLog GC      │         │
│  │  (imm -> L0)     │  │ (merge SSTables)  │  │ (reclaim space)  │         │
│  └──────────────────┘  └───────────────────┘  └──────────────────┘         │
└────────────────────────────────────────────────────────────────────────────┘
```

### Layer Details

**Client Layer**: The `Tree` struct is the main entry point, providing methods to begin transactions, create checkpoints, and manage the database lifecycle. Each `Transaction` maintains its own write buffer (`write_set`) for staging changes before commit, enabling atomic multi-key updates and rollback support.

**Core Components**: These in-memory structures coordinate between the client API and persistent storage:
- The **Oracle** assigns monotonically increasing sequence numbers and detects write-write conflicts between concurrent transactions
- The **CommitPipeline** queues pending commits and ensures WAL writes are serialized while allowing concurrent memtable updates
- The **MemTable** is a lock-free skip list that holds recent writes before they are flushed to disk
- **Snapshots** capture the visible sequence number at transaction start, providing isolation from concurrent commits

**Storage Layer**: All durable state lives here:
- The **WAL** ensures committed data survives crashes by persisting entries before they become visible
- **SSTables** are immutable sorted files organized into levels, with Level 0 containing recently flushed data
- The **VLog** stores large values separately, reducing SSTable size and compaction cost
- The **Manifest** tracks which SSTables exist at each level and their key ranges

**Background Tasks**: These run asynchronously to maintain system health:
- **Flush** converts immutable memtables into Level 0 SSTables
- **Compaction** merges overlapping SSTables to reduce read amplification and reclaim space
- **VLog GC** removes stale entries from value log files when their discard ratio exceeds a threshold

### Component Overview

| Component | Purpose | Key Files |
|-----------|---------|-----------|
| **Tree** | Public API for database operations | `src/lsm.rs` |
| **Transaction** | ACID transaction with MVCC | `src/transaction.rs` |
| **Oracle** | Transaction timestamp and conflict detection | `src/oracle.rs` |
| **CommitPipeline** | Lock-free commit queue (inspired by Pebble) | `src/commit.rs` |
| **MemTable** | In-memory skip list for recent writes | `src/memtable/` |
| **Snapshot** | Point-in-time consistent view | `src/snapshot.rs` |
| **WAL** | Write-ahead log for durability | `src/wal/` |
| **SSTable** | Sorted string table on disk | `src/sstable/` |
| **VLog** | Value log for large value separation | `src/vlog.rs` |
| **Manifest** | Level metadata and SSTable tracking | `src/levels/` |

### Directory Structure

```
database_path/
├── wal/                    # Write-ahead log segments
│   ├── 00000000000000000001.wal
│   └── 00000000000000000002.wal
├── sstables/               # SSTable files
│   ├── 00000000000000000001.sst
│   └── 00000000000000000002.sst
├── manifest/               # Level manifest files
│   └── 00000000000000000001.manifest
├── vlog/                   # Value log files (if enabled)
│   ├── 00000000000000000001.vlog
│   └── 00000000000000000002.vlog
├── versioned_index/        # B+tree index (if enabled)
└── LOCK                    # Lock file
```

---

## Write Path

This section covers how data flows from a transaction's `set()` call through to durable storage.

### Overview

A write operation proceeds through three main phases:

1. **Transaction Phase**: The application buffers writes in the transaction's local `write_set`. No locks are held, no disk I/O occurs, and other transactions cannot see these changes.

2. **Commit Phase**: When the application calls `commit()`, the transaction coordinates with the Oracle for conflict detection, assigns sequence numbers, and enters the commit pipeline.

3. **Background Phase**: Asynchronously, full memtables are rotated and flushed to Level 0 SSTables.

The design minimizes contention during concurrent writes. The WAL write is kept as short as possible, while memtable insertion and visibility publishing can proceed in parallel.

```
┌────────────────────────────────────────────────────────────────────────────┐
│                              TRANSACTION PHASE                             │
│                                                                            │
│  txn.set(key, value)  ──►  Buffer in write_set  ──►  txn.commit()          │
│                            (BTreeMap)                                      │
└────────────────────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────────────────────┐
│                                 COMMIT PHASE                               │
│                                                                            │
│  ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐         │
│  │     Oracle      │───►│  Assign SeqNum  │───►│  Create Batch   │         │
│  │ (conflict check)│    │  (atomic inc)   │    │  (entries)      │         │
│  └─────────────────┘    └─────────────────┘    └─────────────────┘         │
└────────────────────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────────────────────┐
│                         COMMIT PIPELINE (Serialized)                       │
│                                                                            │
│  ┌─────────────────┐    ┌─────────────────┐                                │
│  │ Acquire Write   │───►│   Write to WAL  │                                │
│  │     Mutex       │    │  (durability)   │                                │
│  └─────────────────┘    └─────────────────┘                                │
│                                  │                                         │
│                                  ▼                                         │
│                    ┌─────────────────────────────┐                         │
│                    │  Value Size > Threshold?    │                         │
│                    └─────────────────────────────┘                         │
│                         │                  │                               │
│                    Yes  ▼                  ▼  No                           │
│            ┌─────────────────┐    ┌─────────────────┐                      │
│            │ Append to VLog  │    │  Store Inline   │                      │
│            │ Create Pointer  │    │  (in SSTable)   │                      │
│            └─────────────────┘    └─────────────────┘                      │
│                         │                  │                               │
│                         └────────┬─────────┘                               │
│                                  ▼                                         │
│                    ┌─────────────────────────────┐                         │
│                    │   Apply to Active MemTable  │                         │
│                    │       (skip list insert)    │                         │
│                    └─────────────────────────────┘                         │
│                                  │                                         │
│                                  ▼                                         │
│                    ┌─────────────────────────────┐                         │
│                    │  Publish Visibility         │                         │
│                    │  (update visible_seq_num)   │                         │
│                    └─────────────────────────────┘                         │
└────────────────────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────────────────────┐
│                               BACKGROUND TASKS                             │
│                                                                            │
│  ┌─────────────────────────┐    ┌─────────────────────────────────────┐    │
│  │  MemTable Full?         │───►│  Rotate: Active → Immutable Queue   │    │
│  │  (size > threshold)     │Yes │  Create new empty Active MemTable   │    │
│  └─────────────────────────┘    └─────────────────────────────────────┘    │
│                                              │                             │
│                                              ▼                             │
│                                 ┌─────────────────────────────────────┐    │
│                                 │  Flush Immutable to L0 SSTable      │    │
│                                 │  (background task, async)           │    │
│                                 └─────────────────────────────────────┘    │
└────────────────────────────────────────────────────────────────────────────┘
```

### Key Implementation Details

**Transaction Buffering**

All writes within a transaction are buffered in a `BTreeMap<Key, Vec<Entry>>` called `write_set`. The BTreeMap maintains keys in sorted order, which enables efficient range scans during read-your-own-writes checks. Multiple writes to the same key within a transaction are stored as a vector, preserving the write order.

This buffering approach provides:
- **Read-your-own-writes (RYOW)**: Uncommitted changes are visible to subsequent reads within the same transaction
- **Savepoint support**: The write set can be partially rolled back to a previous savepoint
- **Atomic commit**: All buffered writes are applied together or not at all

**Commit Pipeline**

The commit pipeline is inspired by [Pebble's commit design](https://github.com/cockroachdb/pebble/blob/master/docs/rocksdb.md#commit-pipeline). The key insight is that while WAL writes must be serialized (to maintain a consistent log order), other commit steps can proceed concurrently.

The pipeline operates in three stages:
1. **Write**: Acquire the write mutex, append entries to WAL, release mutex
2. **Apply**: Insert entries into the active memtable (lock-free skip list)
3. **Publish**: Update `visible_seq_num` to make entries visible to readers

A semaphore limits concurrent commits (default: 8) to prevent memory exhaustion from too many in-flight transactions.

**Value Separation (WiscKey)**

When the Value Log is enabled, the commit path checks each value's size against `vlog_value_threshold` (default: 1KB):

- **Large values** (> threshold): Written to VLog file, a 25-byte `ValuePointer` is stored in the memtable/SSTable
- **Small values** (<= threshold): Stored inline in the memtable/SSTable

This trade-off reduces write amplification during compaction (fewer bytes to merge) at the cost of an extra read for large values. The threshold can be tuned based on workload characteristics.

**Durability Modes**

SurrealKV offers two durability levels:
- **Eventual** (default): Data is written to the OS buffer cache but not fsynced. This is fast but data may be lost if the system crashes before the OS flushes to disk.
- **Immediate**: Each commit calls fsync before returning. This guarantees durability but reduces throughput.

**Sequence Numbers**

Every entry receives a monotonically increasing 56-bit sequence number (the upper 56 bits of the 64-bit trailer). Sequence numbers serve multiple purposes:
- **MVCC visibility**: Readers with snapshot seq=N see only entries with seq <= N
- **Ordering**: During compaction, newer entries supersede older ones with the same key
- **Conflict detection**: The Oracle tracks which keys were written at which sequence numbers

**MemTable Rotation**

When the active memtable exceeds `max_memtable_size`, it is rotated:
1. The current memtable is marked immutable and added to a flush queue
2. A new empty memtable becomes the active one
3. A background task asynchronously flushes immutable memtables to Level 0 SSTables

This rotation is transparent to writers and ensures the active memtable never grows unbounded.

---

## Read Path

The read path describes how data is retrieved from a transaction's `get()` call, traversing all storage layers. Understanding the read path helps explain SurrealKV's performance characteristics and the trade-offs involved in LSM-tree designs.

### Overview

In an LSM tree, reads must check multiple locations because the same key may exist in several places simultaneously:

- The transaction's uncommitted write buffer
- The active memtable (most recent committed writes)
- One or more immutable memtables (pending flush)
- Level 0 SSTables (recently flushed, may overlap)
- Level 1+ SSTables (compacted, non-overlapping)

The search proceeds from newest to oldest, stopping at the first match. This ensures that the most recent version of a key is always returned.

### Read Amplification

**Read amplification** measures how many locations must be checked to find a key. In the worst case (key doesn't exist), a read may check all levels. LSM trees trade write performance for read performance: writes are fast (sequential appends) but reads may require multiple lookups.

SurrealKV minimizes read amplification through:
- **Bloom filters**: Quickly eliminate SSTables that definitely don't contain a key
- **Block cache**: Keep frequently accessed data blocks in memory
- **Non-overlapping levels**: L1+ SSTables don't overlap, enabling binary search

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                           TRANSACTION READ                                  │
│                                                                             │
│  txn.get(key)  ──►  Check write_set (RYOW)  ──►  Found? Return immediately  │
│                            │                                                │
│                            ▼ Not in write_set                               │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│                           SNAPSHOT READ                                     │
│                                                                             │
│  Create Snapshot (seq_num = visible_seq_num at txn start)                   │
│                            │                                                │
│                            ▼                                                │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │ Search Order (stop on first match with seq_num <= snapshot.seq_num): │   │
│  │                                                                      │   │
│  │   1. Active MemTable (newest writes, in-memory skiplist)             │   │
│  │              │                                                       │   │
│  │              ▼                                                       │   │
│  │   2. Immutable MemTables (newest first, pending flush)               │   │
│  │              │                                                       │   │
│  │              ▼                                                       │   │
│  │   3. Level 0 SSTables (ALL tables, may overlap)                      │   │
│  │              │                                                       │   │
│  │              ▼                                                       │   │
│  │   4. Level 1...N SSTables (binary search, non-overlapping)           │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────-───┐
│                         SSTABLE LOOKUP DETAIL                               |
│                                                                             │
│  For each SSTable candidate:                                                │
│                                                                             │
│  ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐          │
│  │  Bloom Filter   │───►│  Index Block    │───►│  Data Block     │          │
│  │  (may_contain?) │ Yes│  (binary search)│    │  (with cache)   │          │
│  └─────────────────┘    └─────────────────┘    └─────────────────┘          │
│         │ No                                                                │
│         ▼                                                                   │
│    Skip to next SSTable                                                     │
└─────────────────────────────────────────────────────────────────────────-───┘
┌─────────────────────────────────────────────────────────────────────────────┐
│                         VALUE RESOLUTION                                    │
│                                                                             │
│                    ┌─────────────────────────────-┐                         │
│                    │  Is ValueLocation a Pointer? │                         │
│                    └─────────────────────────────-┘                         │
│                         │                  │                                │
│                    Yes  ▼                  ▼  No                            │
│            ┌────────────────-─┐    ┌─────────────────┐                      │
│            │  Read from VLog  │    │ Return Inline   │                      │
│            │  (file_id,offset)│    │    Value        │                      │
│            └─────────────────-┘    └─────────────────┘                      │
│                         │                  │                                │
│                         └────────┬─────────┘                                │
│                                  ▼                                          │
│                         Return Value to Caller                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

### Key Implementation Details

**RYOW (Read Your Own Writes)**

Before checking persistent storage, a read first consults the transaction's local `write_set` buffer. This ensures uncommitted changes are visible to the transaction that made them, which is needed for multi-statement transactions that read data they just wrote.

**Snapshot Isolation**

When a transaction begins, it captures the current `visible_seq_num` as its snapshot. All reads within that transaction see a consistent view of the database as of that moment:
- Entries with `seq_num <= snapshot.seq_num` are visible
- Entries with `seq_num > snapshot.seq_num` are invisible (committed after the transaction started)

This provides **repeatable reads**: the same query will return the same results throughout the transaction, regardless of concurrent commits.

**Level 0 vs Level 1+**

Level 0 is special because it contains recently flushed memtables whose key ranges may overlap. When searching L0, **all** SSTables must be checked because any of them might contain the target key.

Levels 1 and above maintain a **sorted, non-overlapping** invariant: each SSTable covers a distinct key range. This enables binary search to find the single SSTable that might contain a key, reducing read amplification for deeper levels.

**Bloom Filters**

Each SSTable includes a Bloom filter, a probabilistic data structure that can definitively say "key is NOT in this table" or "key MIGHT be in this table." With a 1% false positive rate (tunable), Bloom filters eliminate ~99% of unnecessary SSTable reads.

The filter is loaded into memory when the SSTable is opened, so checking it is extremely fast compared to reading from disk.

**Block Cache**

SSTable data is organized into blocks (default: 4KB). When a block is read from disk, it is cached in an LRU (Least Recently Used) cache. Subsequent reads for keys in the same block are served from memory.

The cache key is `(file_id, block_offset)`, ensuring that blocks from different SSTables don't collide.

**VLog Resolution**

When value separation is enabled, large values are not stored directly in SSTables. Instead, a `ValuePointer` (25 bytes) indicates where to find the actual value in the VLog.

The read path checks the `BIT_VALUE_POINTER` flag in the value's metadata byte:
- If set: Read the value from VLog at the specified `(file_id, offset)`
- If not set: Return the inline value directly

This extra indirection adds latency for large value reads but reduces SSTable size and compaction cost.

---

## Versioning and MVCC

SurrealKV implements Multi-Version Concurrency Control (MVCC) with snapshot isolation for concurrent reads and writes. MVCC stores multiple versions of each key, allowing readers and writers to operate concurrently without blocking each other.

### Why MVCC?

MVCC provides the following:

1. **Non-blocking reads**: Readers never block writers, and writers never block readers. Each transaction sees a consistent snapshot.

2. **Time-travel queries**: Applications can query historical data at specific timestamps, enabling auditing, debugging, and undo functionality.

3. **Optimistic concurrency**: Transactions proceed without locks and check for conflicts at commit time, which works well for read-heavy workloads.

4. **Isolation guarantees**: Each transaction sees a consistent view of the database, unaffected by concurrent modifications.

### InternalKey Structure

Every key stored in SurrealKV is an `InternalKey` that combines the user-provided key with version metadata. This encoding is central to how MVCC works:

```
┌──────────────────┬───────────────────┬──────────────────┐
│    user_key      │    trailer (8B)   │  timestamp (8B)  │
│   (variable)     │  seq_num | kind   │   nanoseconds    │
└──────────────────┴───────────────────┴──────────────────┘

trailer = (seq_num << 8) | kind

kind values:
  0 = Delete (hard delete, tombstone)
  1 = SoftDelete (versioned delete)
  2 = Set (normal write)
  6 = Replace (replaces all previous versions)
```

The **trailer** packs the sequence number and operation kind into 8 bytes. The sequence number occupies the upper 56 bits, and the kind occupies the lower 8 bits. This encoding allows efficient comparison: keys are sorted first by user key, then by sequence number (descending), so the newest version appears first.

The **timestamp** (optional, enabled with versioning) stores the application-provided timestamp in nanoseconds. This enables point-in-time queries using actual wall-clock times rather than internal sequence numbers.

**Operation Kinds Explained:**

- **Delete (0)**: A hard delete that removes the key permanently. During compaction at the bottom level, this tombstone can be dropped.
- **SoftDelete (1)**: A versioned delete that preserves history. The key appears deleted to current readers but historical queries can still see previous versions.
- **Set (2)**: A normal write that creates a new version of the key.
- **Replace (6)**: A special write that supersedes all previous versions, allowing aggressive garbage collection.

### Snapshot Isolation

Snapshot isolation ensures that each transaction sees a consistent view of the database. When a transaction starts, it captures the current `visible_seq_num`. All operations within that transaction see exactly the state as of that moment.

```
Timeline:
─────────────────────────────────────────────────────────────►
     │           │           │           │
  seq=1       seq=2       seq=3       seq=4
  Set(A,1)    Set(A,2)    Set(A,3)    Delete(A)
     │           │           │           │
     │           │           │           │
     ▼           ▼           ▼           ▼
┌─────────────────────────────────────────────────────────┐
│ Snapshot at seq=2 sees: A=2                             │
│ Snapshot at seq=3 sees: A=3                             │
│ Snapshot at seq=4 sees: A deleted (tombstone)           │
└─────────────────────────────────────────────────────────┘
```

**How visibility works:**

When reading key A with snapshot seq=2:
1. Find all versions of A: (seq=4, Delete), (seq=3, Set), (seq=2, Set), (seq=1, Set)
2. Filter to versions with seq <= 2: (seq=2, Set), (seq=1, Set)
3. Return the newest visible version: seq=2, value=2

### The Oracle

The Oracle is responsible for:
- **Sequence number allocation**: Atomically incrementing the global sequence counter for each commit
- **Conflict detection**: Tracking which keys were written at which sequence numbers to detect write-write conflicts

When two transactions both modify the same key, the Oracle ensures only one can commit. The first to call commit wins; the second receives a conflict error and must retry.

### Version Retention

By default, compaction drops old versions as soon as they are no longer visible to any active snapshot. When versioning is enabled (`with_versioning(true, retention_ns)`), the behavior changes:

- **All versions are preserved** during compaction, not just the latest
- **Retention period**: Versions older than `retention_ns` nanoseconds may be garbage collected
- **Unlimited retention**: Setting `retention_ns = 0` keeps all versions forever

This enables time-travel queries where applications can read the database state at any historical timestamp within the retention window.

### Versioned Storage Architecture

When versioning is enabled, SurrealKV automatically configures the storage layer to support efficient historical queries. The key insight is that all values are stored in the VLog, and indexes (both LSM and optional B+tree) store only value pointers. This architecture enables multiple index structures on top of a single value store.

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                     VERSIONED STORAGE ARCHITECTURE                          │
│                                                                             │
│  When versioning is enabled, all values are stored in VLog.                 │
│  Indexes store only value pointers, enabling multiple index structures.     │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                        INDEX LAYER                                  │    │
│  │                                                                     │    │
│  │   ┌───────────────────┐         ┌───────────────────┐               │    │
│  │   │  LSM Tree Index   │         │  B+Tree Index     │               │    │
│  │   │  (always present) │         │  (optional)       │               │    │
│  │   │                   │         │                   │               │    │
│  │   │  InternalKey ──►  │         │  InternalKey ──►  │               │    │
│  │   │  ValuePointer     │         │  ValuePointer     │               │    │
│  │   └─────────┬─────────┘         └─────────┬─────────┘               │    │
│  │             │                             │                         │    │
│  │             └──────────────┬──────────────┘                         │    │
│  │                            │                                        │    │
│  └────────────────────────────┼────────────────────────────────────────┘    │
│                               │                                             │
│                               ▼                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                        VALUE LOG (VLog)                             │    │
│  │                                                                     │    │
│  │   All versioned entries stored here (vlog_value_threshold = 0)      │    │
│  │                                                                     │    │
│  │   ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐                   │    │
│  │   │ Entry 1 │ │ Entry 2 │ │ Entry 3 │ │ Entry 4 │ ...               │    │
│  │   │ v1 of A │ │ v2 of A │ │ v1 of B │ │ v3 of A │                   │    │
│  │   └─────────┘ └─────────┘ └─────────┘ └─────────┘                   │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────────────┘
```

**Configuration:**

When `with_versioning(true, retention_ns)` is called:
- VLog is automatically enabled
- `vlog_value_threshold` is set to 0, so all values go to VLog
- The LSM tree stores `(InternalKey → ValuePointer)` instead of inline values

**Two Query Modes:**

| Mode | Configuration | Query Method |
|------|---------------|--------------|
| LSM-only | `with_versioning(true, retention_ns)` | K-way merge across all LSM components |
| B+tree indexed | + `with_versioned_index(true)` | Direct B+tree lookup |

**LSM-based Versioned Queries:**

When using LSM-only versioning, the `history()` API uses a `KMergeIterator` that gathers iterators from:
- Active memtable
- All immutable memtables (pending flush)
- All SSTable levels (L0, L1, L2, ...)

These are merged using a k-way merge that returns entries sorted by (user_key, seq_num descending). Finding all versions of a key requires scanning all levels.

**B+tree Versioned Queries:**

When `with_versioned_index(true)` is also enabled, versioned queries go directly to the B+tree which stores all `(InternalKey → ValuePointer)` entries. The B+tree keeps entries sorted, so all versions of a key are contiguous and can be retrieved with a single seek.

The B+tree implementation uses a disk-based page management approach, with the handling of large entries drawing inspiration from SQLite's overflow page design.

**Why Use B+tree Index:**

The motivation for the optional B+tree index:
- LSM stores keys across multiple SSTables; finding all versions requires checking every level
- B+tree consolidates all versions in a single sorted structure
- Single index lookup vs. merging N sources (where N = memtables + SSTables)

**B+tree Trade-offs:**

| Pros | Cons |
|------|------|
| Single index lookup instead of k-way merge | Slow insertion (in-place modified tree on disk) |
| Fast point and range queries for history | Insert performance slows during LSM writes |
| All versions of a key are contiguous | Every LSM write also updates B+tree |

**Timestamp Ordering Limitation:**

When using LSM-only versioning (without B+tree index), timestamps inserted "back in time" will not be read correctly. This occurs because:
- LSM orders entries by `(user_key ASC, seq_num DESC)`
- Point-in-time queries (`get_at`) find the first entry with `seq_num <= snapshot_seq` where `timestamp <= query_timestamp`
- A later-inserted entry with an earlier timestamp will have a higher sequence number, causing it to be returned instead of the correct historical value

To support out-of-order timestamp inserts, enable the B+tree index with `with_versioned_index(true)`. The B+tree stores entries sorted by `(user_key, timestamp)` and supports in-place updates, correctly handling historical data insertion.

**Read-after-Write Consistency:**

Currently, the B+tree index is updated synchronously during LSM writes, providing read-after-write consistency for versioned queries. A recently written version is immediately visible via the `history()` API.

**Future Work:**

Async indexing could improve write performance by decoupling B+tree updates from the LSM write path. However, this would require relaxing the read-after-write consistency guarantee (recently written versions may not be immediately visible via B+tree queries). This is a potential optimization for future versions.

---

## Compaction

Compaction is a background process that merges SSTables to reduce read amplification, reclaim space from obsolete versions, and maintain the leveled structure of the LSM tree.

### Why Compaction is Necessary

Without compaction, an LSM tree would degrade over time:
- **Read amplification grows**: More SSTables means more files to check for each read
- **Space amplification grows**: Obsolete versions and tombstones accumulate
- **Level 0 bloats**: Overlapping SSTables in L0 hurt read performance

Compaction addresses these issues by:
1. Merging overlapping SSTables into non-overlapping ones
2. Dropping obsolete versions that are no longer visible
3. Removing tombstones at the bottom level
4. Maintaining size ratios between levels

### Leveled Compaction Strategy

SurrealKV uses **leveled compaction**, similar to RocksDB's default strategy. Each level has a target size, with each level being approximately 10x larger than the previous:

- **Level 0**: ~64MB (4 files × 16MB each, overlapping allowed)
- **Level 1**: ~256MB (non-overlapping)
- **Level 2**: ~2.5GB (non-overlapping)
- **Level N**: 10x Level N-1

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                         LSM TREE STRUCTURE                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  LEVEL 0 (Overlapping - recently flushed memtables)                         │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐                                        │
│  │ SST A   │ │ SST B   │ │ SST C   │  ← May have overlapping key ranges     │
│  │ a───z   │ │ d───m   │ │ b───k   │                                        │
│  └─────────┘ └─────────┘ └─────────┘                                        │
│       │           │           │                                             │
│       └───────────┼───────────┘                                             │
│                   ▼                                                         │
│  LEVEL 1 (Non-overlapping, sorted)                                          │
│  ┌─────────┬─────────┬─────────┐                                            │
│  │ SST 1   │ SST 2   │ SST 3   │  ← Non-overlapping, can binary search      │
│  │ a───f   │ g───m   │ n───z   │                                            │
│  └─────────┴─────────┴─────────┘                                            │
│                   │                                                         │
│                   ▼                                                         │
│  LEVEL 2 (Non-overlapping, 10x larger)                                      │
│  ┌──────┬──────┬──────┬──────┐                                              │
│  │SST 1 │SST 2 │SST 3 │SST 4 │                                              │
│  │ a─c  │ d─h  │ i─o  │ p─z  │                                              │
│  └──────┴──────┴──────┴──────┘                                              │
│                   │                                                         │
│                   ▼                                                         │
│  LEVEL N (Bottom level)                                                     │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                         COMPACTION TRIGGER                                  │
│                                                                             │
│  Calculate scores for each level:                                           │
│    L0 score = max(file_count / 4, total_bytes / max_bytes)                  │
│    Ln score = level_bytes / target_bytes_for_level                          │
│                                                                             │
│  Pick level with highest score >= 1.0                                       │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│                          FILE SELECTION                                     │
│                                                                             │
│  1. Select file by priority:                                                │
│     - ByCompensatedSize: file_size * (1 + delete_ratio * 0.5)               │
│     - OldestSmallestSeqFirst: oldest sequence number                        │
│                                                                             │
│  2. Expand to include all files sharing boundary keys                       │
│                                                                             │
│  3. Select overlapping files from target level                              │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│                      COMPACTION ITERATOR                                    │
│                                                                             │
│  For each key in merged iterator:                                           │
│                                                                             │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │                   SNAPSHOT-AWARE DECISION                             │  │
│  │                                                                       │  │
│  │  1. Find earliest snapshot that can see this version                  │  │
│  │                                                                       │  │
│  │  2. If same visibility boundary as newer version → DROP (superseded)  │  │
│  │                                                                       │  │
│  │  3. If visible to any active snapshot → KEEP                          │  │
│  │                                                                       │  │
│  │  4. If versioning enabled and within retention → KEEP                 │  │
│  │                                                                       │  │
│  │  5. Tombstone at bottom level with no snapshots → DROP                │  │
│  │                                                                       │  │
│  │  6. Otherwise → DROP                                                  │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                             │
│  Track VLog discard stats for garbage collection                            │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│                           FINISH                                            │
│                                                                             │
│  1. Write new SSTable to disk                                               │
│  2. Update manifest atomically (add new, remove old)                        │
│  3. Delete old SSTable files                                                │
│  4. Flush VLog discard stats                                                │
└─────────────────────────────────────────────────────────────────────────────┘
```

### Compaction Scoring

Each level receives a compaction score that determines priority:

**Level 0 Score:**
```
L0_score = max(file_count / L0_compaction_trigger, total_bytes / max_bytes)
```
L0 is special because overlapping files hurt read performance. The score considers both file count and total size.

**Level N Score (N > 0):**
```
Ln_score = level_bytes / target_bytes_for_level
```
Higher levels are compacted when they exceed their target size.

The level with the highest score >= 1.0 is selected for compaction.

### File Selection Algorithms

When a level is selected for compaction, individual files are chosen based on priority:

- **ByCompensatedSize**: Prioritizes files with high delete ratios: `score = file_size * (1 + delete_ratio * 0.5)`. This helps reclaim space from files with many tombstones.
- **OldestSmallestSeqFirst**: Prioritizes files with the oldest sequence numbers, which helps propagate old data to deeper levels.

After selecting the initial file, the compaction expands to include:
1. All files in the source level that share boundary keys (to avoid key range gaps)
2. All overlapping files in the target level

### Write Amplification

**Write amplification** measures how many times data is written to disk over its lifetime. In leveled compaction:
- Each entry is written once per level it passes through
- With a 10x size ratio and N levels, write amplification is approximately N
- Typical values: 10-30x for write-heavy workloads

SurrealKV's value separation (VLog) reduces write amplification by keeping large values out of the compaction process entirely.

---

## Value Log (VLog)

SurrealKV implements WiscKey-style value separation to reduce write amplification for large values. This design is based on the [WiscKey paper](https://www.usenix.org/system/files/conference/fast16/fast16-papers-lu.pdf) from FAST '16.

### The Write Amplification Problem

In a traditional LSM tree, values are stored alongside keys in SSTables. During compaction, entire key-value pairs are read, merged, and rewritten. For large values, this leads to significant write amplification:

- A 1MB value written once may be rewritten 10-30 times as it moves through levels
- SSD lifespan is reduced by excessive writes
- Compaction becomes I/O bound on value size rather than key count

### WiscKey Solution

WiscKey separates keys from values:
- **Keys** (typically small) remain in the LSM tree and participate in compaction
- **Values** (potentially large) are stored separately in an append-only log

This separation provides:
- **Reduced write amplification**: Only keys are compacted, not values
- **Faster compaction**: Less data to read and write
- **Better SSD utilization**: Sequential writes to VLog, random reads amortized by SSD parallelism

**Trade-offs:**
- **Read overhead**: Large value reads require an extra I/O to the VLog
- **Garbage collection**: Stale VLog entries must be reclaimed separately
- **Complexity**: Additional component to manage

### When to Enable VLog

VLog is beneficial when:
- Values are large (> 1KB on average)
- Write volume is high
- SSD longevity is a concern

VLog may hurt performance when:
- Values are small (overhead of pointer > benefit)
- Workload is read-heavy (extra VLog I/O per read)
- Random access patterns (VLog reads are sequential within a file)

### Value Separation Decision

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                      VALUE SEPARATION DECISION                              │
│                                                                             │
│  During commit, for each value:                                             │
│                                                                             │
│                    ┌─────────────────────────────┐                          │
│                    │  value.len() > threshold?   │                          │
│                    │  (default: 1KB)             │                          │
│                    └─────────────────────────────┘                          │
│                         │                  │                                │
│                    Yes  ▼                  ▼  No                            │
│            ┌─────────────────┐    ┌─────────────────┐                       │
│            │  Store in VLog  │    │  Store Inline   │                       │
│            │                 │    │                 │                       │
│            │ ┌─────────────┐ │    │ ┌─────────────┐ │                       │
│            │ │ValuePointer │ │    │ │ Raw Value   │ │                       │
│            │ │ (25 bytes)  │ │    │ │ (variable)  │ │                       │
│            │ │ in SSTable  │ │    │ │ in SSTable  │ │                       │
│            │ └─────────────┘ │    │ └─────────────┘ │                       │
│            └─────────────────┘    └─────────────────┘                       │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

### ValuePointer Structure

```
┌─────────────────────────────────────────────────────────────────┐
│                    ValuePointer (25 bytes)                       │
├─────────┬──────────┬──────────┬────────────┬────────────┬───────┤
│ version │ file_id  │  offset  │  key_size  │ value_size │  crc  │
│  (1B)   │  (4B)    │  (8B)    │   (4B)     │   (4B)     │ (4B)  │
└─────────┴──────────┴──────────┴────────────┴────────────┴───────┘
```

### ValueLocation Encoding

```
┌─────────────────────────────────────────────────────────────────┐
│                      ValueLocation                              │
├─────────┬─────────┬─────────────────────────────────────────────┤
│  meta   │ version │               value                         │
│  (1B)   │  (1B)   │   (inline bytes OR ValuePointer)            │
└─────────┴─────────┴─────────────────────────────────────────────┘

meta bit 0 = BIT_VALUE_POINTER
  If set: value contains ValuePointer (25 bytes)
  If not set: value contains inline bytes
```

### VLog File Format

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                          VLOG FILE FORMAT                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌────────────────────────────────────────────────────────────────────┐     │
│  │                      FILE HEADER (31 bytes)                        │     │
│  ├─────────┬─────────┬─────────┬────────────┬────────────┬────────────┤     │
│  │  magic  │ version │ file_id │ created_at │  max_size  │ compressed │     │
│  │  (4B)   │  (2B)   │  (4B)   │   (8B)     │    (8B)    │    (1B)    │     │
│  │ "VLOG"  │         │         │            │            │            │     │
│  └─────────┴─────────┴─────────┴────────────┴────────────┴────────────┘     │
│                                                                             │
│  ┌────────────────────────────────────────────────────────────────────┐     │
│  │                         ENTRY 1                                    │     │
│  ├───────────┬────────────┬──────────────┬──────────────┬─────────────┤     │
│  │  key_len  │ value_len  │     key      │    value     │    crc32    │     │
│  │   (4B)    │   (4B)     │  (variable)  │  (variable)  │    (4B)     │     │
│  └───────────┴────────────┴──────────────┴──────────────┴─────────────┘     │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                         ENTRY 2                                     │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                ...                                          │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                         ENTRY N                                     │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Entry total size = 8 + key_len + value_len + 4 bytes
File rotation when size >= vlog_max_file_size (default: 256MB)
```

### VLog Configuration

Key configuration options:

| Option | Default | Description |
|--------|---------|-------------|
| `enable_vlog` | false | Enable value separation |
| `vlog_value_threshold` | 1KB | Values larger than this go to VLog |
| `vlog_max_file_size` | 256MB | Maximum size before rotating to new file |
| `vlog_gc_discard_ratio` | 0.5 | Trigger GC when 50% of file is garbage |

### File Management

VLog files are managed as follows:

1. **Active file**: New values are appended to the current active file
2. **Rotation**: When the active file reaches `vlog_max_file_size`, it is closed and a new file is created
3. **Reference counting**: Each VLog file maintains a reference count of active iterators to prevent deletion during reads
4. **Garbage collection**: Files with high discard ratios are compacted to reclaim space (see next section)

---

## VLog Garbage Collection

VLog files accumulate stale entries as values are overwritten or deleted.

### The Garbage Problem

When a key is updated or deleted:
1. The new value is written to a new VLog file (values are append-only)
2. The LSM tree's pointer is updated to reference the new location
3. The old VLog entry becomes unreferenced

Over time, old VLog files may contain only stale entries that are no longer referenced by any SSTable. These files can be safely deleted to reclaim disk space.

### How Garbage is Tracked

The key insight is that a VLog file is safe to delete when **no live SSTable references it**. SurrealKV tracks this using the `oldest_vlog_file_id` metadata in each SSTable:

**During SSTable Creation:**

When writing an SSTable (during flush or compaction), the `TableWriter` tracks the minimum VLog file ID referenced by any value pointer in that SSTable. This is stored in the SSTable's properties as `oldest_vlog_file_id`.

**Computing the Global Minimum:**

The `LevelManifest.min_oldest_vlog_file_id()` method computes the minimum `oldest_vlog_file_id` across all live SSTables:

```rust
// Pseudocode
min_oldest_vlog = manifest
    .iter()
    .filter(|sst| sst.oldest_vlog_file_id > 0)
    .map(|sst| sst.oldest_vlog_file_id)
    .min()
    .unwrap_or(0)
```

Any VLog file with `file_id < min_oldest_vlog` is safe to delete because no SSTable references it.

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                          SST METADATA TRACKING                              │
│                                                                             │
│  During SSTable creation (flush or compaction):                             │
│                                                                             │
│  For each key-value pair:                                                   │
│    if value is a VLog pointer:                                              │
│      min_vlog_file_id = min(min_vlog_file_id, pointer.file_id)              │
│                                                                             │
│  Store min_vlog_file_id as oldest_vlog_file_id in SST properties            │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│                    AFTER FLUSH OR COMPACTION                                │
│                                                                             │
│  1. Compute global minimum:                                                 │
│     min_oldest_vlog = manifest.min_oldest_vlog_file_id()                    │
│                                                                             │
│  2. If min_oldest_vlog == 0: No SSTs reference VLog, skip cleanup           │
│                                                                             │
│  3. Otherwise: Delete VLog files with file_id < min_oldest_vlog             │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│                              ITERATOR SAFETY                                │
│                                                                             │
│  Before deleting a VLog file, check iterator_count:                         │
│    - If iterator_count > 0: Defer deletion (readers may access file)        │
│    - If iterator_count == 0: Safe to delete immediately                     │
│                                                                             │
│  This prevents deleting files that in-flight reads may access.              │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│                         VERSIONED INDEX CLEANUP                             │
│                                                                             │
│  If versioned_index (B+tree) is enabled:                                    │
│    - Scan B+tree entries for stale VLog pointers                            │
│    - Remove entries where pointer.file_id < min_oldest_vlog                 │
│    - Uses read-scan + batched-write-delete for concurrency safety           │
└─────────────────────────────────────────────────────────────────────────────┘
```

### GC Components

| Component | Purpose | Location |
|-----------|---------|----------|
| **oldest_vlog_file_id** | SST metadata tracking oldest VLog file referenced | SST properties |
| **min_oldest_vlog_file_id()** | Computes global minimum across all SSTs | `LevelManifest` method |
| **iterator_count** | Safety counter for active readers | `VLog` struct |
| **cleanup_stale_versioned_index()** | Removes stale B+tree entries | `CoreInner` method |

### GC Trigger Points

VLog garbage collection runs automatically at two points:

1. **After Flush**: When a memtable is flushed to an L0 SSTable
2. **After Compaction**: When SSTables are merged during compaction

Both paths call `vlog.cleanup_obsolete_files(min_oldest_vlog)` after successfully updating the manifest.

### Algorithm Details

**Why Global Minimum Works:**

The global minimum approach is correct because:
- VLog file IDs are monotonically increasing
- Each SSTable records the oldest VLog file it references
- If `min_oldest_vlog = N`, then all live SSTables reference only VLog files >= N
- Therefore, VLog files < N have no live references and can be deleted

**Iterator Safety:**

VLog files cannot be deleted while readers might access them. The `iterator_count` provides a safety net:
- Incremented when a snapshot/iterator is created
- Decremented when the snapshot/iterator is dropped
- If `iterator_count > 0`, deletion is deferred to startup cleanup

This prevents the race condition where GC deletes a file that an in-flight read is about to access.

**Versioned Index Cleanup:**

When the optional B+tree versioned index is enabled, it stores `(InternalKey → ValuePointer)` entries. After VLog files are deleted, stale entries pointing to those files must be removed:

1. **Read Phase**: Scan the B+tree under a read lock, collecting keys with `pointer.file_id < min_oldest_vlog`
2. **Delete Phase**: Remove collected keys in batches under brief write locks

This two-phase approach avoids holding locks during the full scan, allowing concurrent writes.

---

## Recovery

When SurrealKV starts, it recovers the database state from disk by loading the manifest and replaying the WAL. This process ensures that all committed data is restored, even after crashes or power failures.

### Crash Recovery Guarantees

SurrealKV provides the following guarantees:

- **Committed data is durable**: Any transaction that received a successful commit response will be recovered
- **Uncommitted data may be lost**: Transactions that crashed before commit are not recovered
- **Atomic recovery**: Either all of a transaction's writes are recovered, or none are

The strength of the durability guarantee depends on the `Durability` mode used at commit time:
- **Immediate**: Data was fsynced before commit returned; it will survive any crash
- **Eventual**: Data was written to OS buffers; it may be lost if the OS crashes before flushing

### Recovery Process Overview

Recovery proceeds in three phases:

1. **Initialization**: Load metadata and acquire exclusive access
2. **WAL Replay**: Reconstruct in-memory state from the write-ahead log
3. **Finalization**: Clean up orphaned files and start background tasks

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                                INITIALIZATION                               │
│                                                                             │
│  1. Acquire LOCK file (prevents concurrent access)                          │
│                                                                             │
│  2. Load Level Manifest                                                     │
│     - SSTable locations per level                                           │
│     - min_wal_number (WALs older than this are flushed)                     │
│     - last_sequence number                                                  │
│                                                                             │
│  3. Open/create required directories                                        │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│                                  WAL REPLAY                                 │
│                                                                             │
│  For each WAL segment with id >= min_wal_number:                            │
│                                                                             │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │  Read records sequentially:                                           │  │
│  │                                                                       │  │
│  │  ┌─────────────┐     ┌─────────────┐     ┌─────────────┐              │  │
│  │  │ Read Header │ ──► │ Verify CRC  │ ──► │ Decompress  │              │  │
│  │  │ (7 bytes)   │     │             │     │ (if needed) │              │  │
│  │  └─────────────┘     └─────────────┘     └─────────────┘              │  │
│  │         │                   │                   │                     │  │
│  │         ▼                   ▼                   ▼                     │  │
│  │  On corruption:       CRC mismatch:       Apply to MemTable           │  │
│  │  - TolerateWithRepair: truncate & continue                            │  │
│  │  - AbsoluteConsistency: return error                                  │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                             │
│  If MemTable fills during replay → Flush to L0 SSTable                      │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│                                 FINALIZATION                                │
│                                                                             │
│  1. Set sequence numbers:                                                   │
│     visible_seq_num = max(manifest_last_seq, wal_max_seq)                   │
│     log_seq_num = visible_seq_num + 1                                       │
│                                                                             │
│  2. Cleanup orphaned files:                                                 │
│     - SST files not in manifest → delete                                    │
│     - WAL files < min_wal_number → delete                                   │
│                                                                             │
│  3. Start background tasks:                                                 │
│     - MemTable flush task                                                   │
│     - Level compaction task                                                 │
│     - VLog GC task (if VLog enabled)                                        │
└─────────────────────────────────────────────────────────────────────────────┘
```

### WAL Record Format

The WAL uses a block-based format inspired by LevelDB/RocksDB. Each block is 32KB, and records that don't fit in a single block are fragmented across multiple blocks.

```
┌───────────────────────────────────────────────────────────────────────────┐
│                         WAL BLOCK (32KB)                                  │
├───────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌───────┬───────┬───────┬─────────────────────────────────────────────┐  │
│  │  CRC  │ Size  │ Type  │                 Payload                     │  │
│  │ (4B)  │ (2B)  │ (1B)  │              (variable)                     │  │
│  └───────┴───────┴───────┴─────────────────────────────────────────────┘  │
│                                                                           │
│  Record Types:                                                            │
│    0 = Empty (padding)                                                    │
│    1 = Full (complete record in one block)                                │
│    2 = First (start of fragmented record)                                 │
│    3 = Middle (continuation of fragmented record)                         │
│    4 = Last (end of fragmented record)                                    │
│                                                                           │
│  CRC covers: Type byte + Payload                                          │
└───────────────────────────────────────────────────────────────────────────┘
```

**Record Fragmentation:**

Large records that don't fit in the remaining space of a block are split:
- **First**: Contains the beginning of the record
- **Middle**: Contains continuation data (may be multiple)
- **Last**: Contains the final portion

During replay, fragments are reassembled before processing.

### Recovery Modes

SurrealKV supports two recovery modes that control behavior when WAL corruption is detected:

| Mode | Behavior |
|------|----------|
| **AbsoluteConsistency** | Return error on any corruption; do not start |
| **TolerateWithRepair** | Truncate corrupted tail and continue recovery |

`TolerateWithRepair` is useful for recovering from crashes that left partial writes. The corrupted data represents uncommitted transactions (they crashed before completing), so truncating them maintains consistency.

### Manifest's Role in Recovery

The manifest tracks the durable state of the LSM tree:
- Which SSTables exist at each level
- Their key ranges and file sizes
- The `min_wal_number`: WALs older than this have been fully flushed to SSTables

During recovery, SurrealKV only replays WALs with `id >= min_wal_number`. Older WALs contain data that is already in SSTables and can be safely deleted.

### Orphan File Cleanup

After recovery, the database may contain orphaned files:
- **SST files not in manifest**: Created but crashed before manifest update; safe to delete
- **WAL files < min_wal_number**: Already flushed to SSTables; safe to delete
- **Temporary files**: Incomplete compaction outputs; safe to delete

This cleanup ensures disk space is reclaimed after crashes.

---

## Checkpoint and Restore

Checkpoints create consistent point-in-time snapshots of the database for backup and recovery. Checkpoints capture the entire database state at a specific moment, enabling point-in-time recovery.

### Use Cases

**Backup and Disaster Recovery:**
- Create periodic checkpoints to external storage (S3, NFS, etc.)
- Restore from checkpoint after catastrophic failure
- Geographic distribution of backup copies

**Testing and Development:**
- Checkpoint production data for test environments
- Restore to known state between test runs
- Debug issues with production data snapshots

**Migration:**
- Checkpoint before major upgrades
- Transfer database between machines
- Clone database for read replicas

### Why Flush Before Checkpoint?

A checkpoint must capture a consistent state. The active memtable contains uncommitted and in-flight data that may be partially applied. By flushing to SSTables first:

1. All committed data is durably stored in files
2. The checkpoint captures exactly what was committed at that moment
3. Recovery from checkpoint is deterministic

### Checkpoint Process

```
┌─────────────────────────────────────────────────────────────────────────┐
│                         CHECKPOINT CREATION                             │
│                                                                         │
│  1. PREPARATION                                                         │
│     - Flush active memtable to SSTable                                  │
│     - Wait for all immutable memtables to flush                         │
│     - Capture current sequence number                                   │
│                                                                         │
│  2. COPY FILES TO CHECKPOINT DIRECTORY                                  │
│     checkpoint_path/                                                    │
│     ├── sstables/         ← Copy all SSTable files                      │
│     ├── wal/              ← Copy WAL segments                           │
│     ├── manifest/         ← Copy manifest file                          │
│     ├── vlog/             ← Copy VLog files (if enabled)                │
│     ├── discard_stats/    ← Copy discard stats (if VLog)                │
│     ├── delete_list/      ← Copy delete list (if VLog)                  │
│     └── metadata.json     ← Checkpoint metadata                         │
│                                                                         │
│  3. METADATA                                                            │
│     {                                                                   │
│       "timestamp": 1234567890,                                          │
│       "sequence_number": 12345,                                         │
│       "sstable_count": 42,                                              │
│       "total_size": 1073741824                                          │
│     }                                                                   │
└─────────────────────────────────────────────────────────────────────────┘
```

### Restore Flow

```
┌─────────────────────────────────────────────────────────────────────────┐
│                            RESTORE PROCESS                              │
│                                                                         │
│  1. VALIDATION                                                          │
│     - Verify checkpoint metadata exists                                 │
│     - Verify all required files present                                 │
│                                                                         │
│  2. REPLACE DATABASE FILES                                              │
│     - Clear current sstables/, wal/, manifest/, vlog/ directories       │
│     - Copy files from checkpoint                                        │
│                                                                         │
│  3. RELOAD STATE                                                        │
│     - Load manifest from checkpoint                                     │
│     - Clear in-memory memtables (discards uncommitted data)             │
│     - Replay WAL segments from checkpoint                               │
│     - Set sequence numbers from checkpoint                              │
│                                                                         │
│  4. RESUME OPERATIONS                                                   │
│     - Database returns to checkpoint state                              │
│     - Any data written after checkpoint is discarded                    │
└─────────────────────────────────────────────────────────────────────────┘
```