oxibase 0.2.2

Autonomous relational database management system with MVCC, time-travel queries, and full ACID compliance
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
---
layout: default
title: Storage Engine
parent: Architecture
nav_order: 2
---

# Storage Engine

This document provides a detailed overview of Oxibase's storage engine, including its design principles, components, and how data is stored and retrieved.

## Storage Engine Design

Oxibase's storage engine is designed with the following principles:

- **Memory-optimized** - Prioritizes in-memory performance with optional persistence
- **MVCC-based** - Uses multi-version concurrency control for transaction isolation
- **Version-organized** - Tracks different versions of rows for transaction isolation
- **Type-specialized** - Uses different strategies for different data types
- **Index-accelerated** - Multiple index types to optimize different query patterns

## Architecture Overview

```mermaid
graph TB
    subgraph "Engine Layer - Orchestration"
        MVCCEngine["MVCCEngine<br/>src/storage/mvcc/engine.rs<br/>• Table lifecycle<br/>• Transaction registry<br/>• Persistence manager<br/>• Schema catalog"]

        TransactionRegistry["TransactionRegistry<br/>• Transaction ID allocation<br/>• Commit sequence numbers<br/>• Visibility checking"]

        PersistenceManager["PersistenceManager<br/>• WAL management<br/>• Snapshot creation<br/>• Recovery coordination"]
    end

    subgraph "Table Layer - Transaction-aware Access"
        MVCCTable["MVCCTable<br/>src/storage/mvcc/table.rs<br/>• Transaction isolation<br/>• Row operations<br/>• Index selection<br/>• Commit/rollback"]

        TransactionVersionStore["TransactionVersionStore<br/>• Uncommitted changes<br/>• Write-set tracking<br/>• Conflict detection"]
    end

    subgraph "Storage Layer - Version Management"
        VersionStore["VersionStore<br/>src/storage/mvcc/version_store.rs<br/>• Version chains<br/>• Visibility filtering<br/>• Arena-based scanning<br/>• Zone maps"]

        RowArena["RowArena<br/>• Contiguous memory storage<br/>• 50x+ scan speedup<br/>• Zero-copy reads"]

        VersionChainEntry["VersionChainEntry<br/>• Current version<br/>• Arc prev pointer<br/>• Arena index"]
    end

    subgraph "Index Layer"
        BTreeIndex["BTreeIndex<br/>src/storage/mvcc/btree_index.rs<br/>• Range queries O(log n + k)<br/>• Cached min/max"]

        HashIndex["HashIndex<br/>src/storage/mvcc/hash_index.rs<br/>• Equality O(1)<br/>• ahash for TEXT/JSON"]

        BitmapIndex["BitmapIndex<br/>• Boolean columns<br/>• Bitwise AND/OR/NOT"]

        MultiColumnIndex["MultiColumnIndex<br/>• Composite key lookups<br/>• Prefix matching"]
    end

    MVCCEngine -->|creates| MVCCTable
    MVCCEngine -->|manages| TransactionRegistry
    MVCCEngine -->|uses| PersistenceManager

    MVCCTable -->|reads/writes| TransactionVersionStore
    MVCCTable -->|delegates to| VersionStore

    TransactionVersionStore -->|commits to| VersionStore

    VersionStore -->|stores versions in| RowArena
    VersionStore -->|maintains| VersionChainEntry
    VersionStore -->|updates| BTreeIndex
    VersionStore -->|updates| HashIndex
    VersionStore -->|updates| BitmapIndex
    VersionStore -->|updates| MultiColumnIndex

    TransactionRegistry -.checks visibility.-> VersionStore
```

## Core Components and Responsibilities

The storage engine consists of four main classes that handle different aspects of data storage and retrieval:

| Component | File | Primary Responsibilities |
|-----------|------|--------------------------|
| `MVCCEngine` | [engine.rs:254-279] | Database lifecycle, table creation/deletion, transaction coordination, schema management, persistence orchestration |
| `MVCCTable` | [table.rs:36-45] | Per-transaction table view, row insert/update/delete, index lookup optimization, commit/rollback execution |
| `VersionStore` | [version_store.rs:157-193] | Version chain management, visibility checking, arena-based scanning, row counting, index population |
| `TransactionVersionStore` | [version_store.rs:1354-1414] | Uncommitted changes buffer, write-set conflict tracking, local version cache |

## Storage Engine Data Flow

This diagram shows how a typical transaction operation flows through the storage layers, from the public API down to the physical storage structures.

```mermaid
sequenceDiagram
    participant Client
    participant Transaction as "Transaction<br/>(MvccTransaction)"
    participant Engine as "MVCCEngine"
    participant Table as "MVCCTable"
    participant TxnStore as "TransactionVersionStore"
    participant VerStore as "VersionStore"
    participant Index as "BTreeIndex/HashIndex"
    participant WAL as "WALManager"

    Client->>Engine: begin_transaction()
    Engine->>TransactionRegistry: allocate txn_id
    Engine->>Transaction: MvccTransaction(txn_id)

    Client->>Transaction: execute("INSERT INTO users VALUES (...)")
    Transaction->>Engine: get_table_for_transaction("users")
    Engine->>Table: MVCCTable::new_with_shared_store()

    Transaction->>Table: insert(row)
    Table->>TxnStore: add_local_version(row_version)
    Note over TxnStore: Buffered in memory<br/>Not yet visible

    Client->>Transaction: commit()
    Transaction->>Table: commit()

    Table->>Index: update indexes (old_value, new_value)
    Table->>TxnStore: commit()
    TxnStore->>VerStore: add_versions_batch()

    VerStore->>RowArena: insert_row()
    Note over VerStore: Versions now visible<br/>to new transactions

    VerStore->>WAL: record_operation()
    WAL-->>VerStore: LSN

    Table-->>Transaction: Ok
    Transaction->>TransactionRegistry: mark_committed(commit_seq)
    Transaction-->>Client: Ok
```

## Storage Components

### Table Structure

Tables in Oxibase are composed of:

- **Metadata** - Schema information, column definitions, and indexes
- **Row Data** - The primary data storage, organized by row
- **Version Store** - Tracks row versions for MVCC
- **Indexes** - B-tree, Hash, Bitmap, and multi-column indexes
- **Transaction Manager** - Manages transaction state and visibility

### Data Types

Oxibase supports a variety of data types, each with optimized storage:

- **INTEGER** - 64-bit signed integers
- **FLOAT** - 64-bit floating-point numbers
- **TEXT** - Variable-length UTF-8 strings
- **BOOLEAN** - Boolean values (true/false)
- **TIMESTAMP** - Date and time values
- **JSON** - JSON documents
- **NULL** - Null values supported for all types

### Version Management

Oxibase tracks different versions of data for transaction isolation:

- Each change creates a new version rather than overwriting
- Versions are associated with transaction IDs
- Visibility rules determine which versions each transaction can see
- Old versions are garbage collected when no longer needed

## Data Storage Format

### In-Memory Format

In memory, data is stored with these characteristics:

- **Row-based primary storage** - Records are stored as coherent rows
- **Version chains** - Linked versions for MVCC
- **Type-specific indexes** - B-tree, Hash, Bitmap based on column type
- **Efficient structures** - Optimized for different data types

### On-Disk Format

When persistence is enabled, data is stored on disk with:

- **Binary serialization** - Compact binary format for storage
- **WAL files** - Write-ahead log for durability
- **Snapshot files** - Point-in-time table snapshots
- **Metadata files** - Schema and index information

## MVCC Implementation

The storage engine uses MVCC to provide transaction isolation:

- **Full Version Chains** - Version history per row linked via pointers
- **Transaction IDs** - Each version is associated with a transaction ID
- **Visibility Rules** - Traverse version chains to find visible versions
- **Lock-Free Reads** - Readers never block writers
- **Automatic Cleanup** - Old versions garbage collected when no longer needed

For more details, see the [MVCC Implementation](mvcc-implementation) and [Transaction Isolation](transaction-isolation) documentation.

### Version Chain Management

Oxibase implements MVCC by maintaining linked version chains for each row. Each `VersionChainEntry` contains a `RowVersion` and an `Arc` pointer to the previous version, enabling O(1) chain cloning for snapshot isolation.

```mermaid
graph LR
    subgraph "versions: ConcurrentInt64Map<VersionChainEntry>"
        row1["row_id: 1"]
        row2["row_id: 2"]
        row3["row_id: 3"]
    end

    subgraph "Version Chain for row_id=1"
        v3["VersionChainEntry<br/>txn_id: 30<br/>deleted_at: 30<br/>arena_idx: None"]
        v2["VersionChainEntry<br/>txn_id: 20<br/>deleted_at: 0<br/>arena_idx: Some(15)"]
        v1["VersionChainEntry<br/>txn_id: 10<br/>deleted_at: 0<br/>arena_idx: Some(5)"]

        v3 -->|"prev: Arc"| v2
        v2 -->|"prev: Arc"| v1
        v1 -->|"prev: None"| end1["Initial version"]
    end

    subgraph "RowArena - Contiguous Memory"
        arena_idx_5["idx 5: row_id=1, txn=10, data=[...]"]
        arena_idx_15["idx 15: row_id=1, txn=20, data=[...]"]
        arena_other["Other rows..."]
    end

    row1 --> v3
    v1 -.arena_idx: Some(5).-> arena_idx_5
    v2 -.arena_idx: Some(15).-> arena_idx_15
```

Key data structures:

- **`versions: ConcurrentInt64Map<VersionChainEntry>`** - Maps `row_id` to the head of its version chain
- **`VersionChainEntry`** - Linked list node with `version: RowVersion`, `prev: Option<Arc<...>>`, and `arena_idx: Option<usize>`
- **`RowArena`** - Contiguous memory storage providing 50x+ faster full table scans via zero-copy reads

The use of `Arc` for the `prev` pointer enables cheap snapshot cloning—creating a snapshot of a table is O(1) because the chain is reference-counted rather than deep-copied.

### Transaction Version Store

The `TransactionVersionStore` buffers uncommitted changes within a transaction, maintaining write-set information for conflict detection during commit.

| Field | Type | Purpose |
|-------|------|---------|
| `local_versions` | `Int64Map<RowVersion>` | Uncommitted inserts/updates/deletes |
| `write_set` | `Int64Map<WriteSetEntry>` | Conflict detection: tracks read version and sequence number |
| `version_store` | `Arc<VersionStore>` | Reference to global committed storage |
| `txn_id` | `i64` | Transaction identifier |

**Commit process:**
1. Check uniqueness constraints on all indexes
2. Update indexes with (old_value, new_value) pairs
3. Flush `local_versions` to `VersionStore` in batch
4. Mark zone maps as stale (for optimizer statistics)

**Rollback process:**
1. Discard all entries in `local_versions`
2. Clear `write_set` tracking data

### Index Architecture

Oxibase provides three index types, each optimized for different query patterns. The `MVCCTable` automatically selects the optimal index type based on column data types.

```mermaid
graph TB
    subgraph "Index Selection Logic"
        auto_select["MVCCTable::auto_select_index_type()<br/>table.rs:105-130"]
    end

    subgraph "Index Type Decision Tree"
        check_bool["Is column BOOLEAN?"]
        check_text["Is column TEXT/JSON?"]
        check_numeric["Is column INTEGER/FLOAT/TIMESTAMP?"]

        bitmap_idx["IndexType::Bitmap<br/>Fast AND/OR/NOT<br/>Low cardinality"]
        hash_idx["IndexType::Hash<br/>O(1) equality<br/>Avoids O(strlen)"]
        btree_idx["IndexType::BTree<br/>O(log n) range queries<br/>Cached min/max"]
    end

    subgraph "Index Implementations"
        BTreeIndexImpl["BTreeIndex<br/>btree_index.rs:74-113<br/>• BTreeMap: sorted values<br/>• AHashMap: O(1) equality<br/>• FxHashMap: row->value"]
        HashIndexImpl["HashIndex<br/>hash_index.rs:105-127<br/>• ahash: fast hashing<br/>• SmallVec: inline storage<br/>• Collision handling"]
        BitmapIndexImpl["BitmapIndex<br/>• Roaring bitmaps<br/>• Bitwise operations<br/>• Compressed storage"]
    end

    auto_select --> check_bool
    check_bool -->|Yes, single column| bitmap_idx
    check_bool -->|No| check_text
    check_text -->|Yes| hash_idx
    check_text -->|No| check_numeric
    check_numeric -->|Yes| btree_idx

    bitmap_idx --> BitmapIndexImpl
    hash_idx --> HashIndexImpl
    btree_idx --> BTreeIndexImpl
```

**Index selection rules:**
- **BOOLEAN**`BitmapIndex` for fast bitwise AND/OR/NOT operations
- **TEXT/JSON**`HashIndex` to avoid O(strlen) comparisons in B-tree nodes
- **INTEGER/FLOAT/TIMESTAMP**`BTreeIndex` for range query support
- **Mixed types**`BTreeIndex` as safe default

### BTreeIndex Internal Structure

The `BTreeIndex` maintains dual data structures for optimal performance across different query types:

| Data Structure | Type | Purpose | Complexity |
|----------------|------|---------|------------|
| `sorted_values` | `RwLock<BTreeMap<Value, RowIdSet>>` | Range queries, MIN/MAX | O(log n + k) |
| `value_to_rows` | `RwLock<AHashMap<Value, RowIdSet>>` | Equality lookups | O(1) |
| `row_to_value` | `RwLock<FxHashMap<i64, Value>>` | Removal by row_id | O(1) |
| `cached_min` | `RwLock<Option<Value>>` | MIN aggregate | O(1) |
| `cached_max` | `RwLock<Option<Value>>` | MAX aggregate | O(1) |

The dual-index strategy (BTreeMap + HashMap) trades memory (~2x for unique values) for optimal query performance: O(1) equality via hash lookup and O(log n + k) range queries via sorted iteration.

### HashIndex Characteristics

The `HashIndex` is designed for high-cardinality TEXT columns where equality queries dominate:

**Advantages:**
- O(1) exact match via `ahash` (faster than SipHash)
- Avoids O(strlen) string comparisons in B-tree traversal
- `SmallVec<[i64; 4]>` reduces allocations for unique indexes

**Limitations:**
- Does **NOT** support range queries
- Does **NOT** support ORDER BY optimization
- Requires exact match on all indexed columns (no partial key lookups)

**Triple-lock write pattern:**
```
add() acquires:
  1. hash_to_rows: Write
  2. row_to_hash: Write
  3. hash_to_values: Write
```

This ensures atomicity but serializes concurrent writes. Read operations (`find`) only acquire a single read lock for minimal contention.

### Row Arena: Zero-Copy Scanning

The `RowArena` stores row data in contiguous memory to eliminate per-row allocation overhead during full table scans:

**Performance characteristics:**
- **50x+ speedup** for full table scans vs. per-row cloning
- Pre-acquire locks once per scan (O(1) instead of O(N))
- Direct slice access via `unsafe` bounds-checked reads
- Cache locality from contiguous layout

**Arena structure:**
```
arena_rows: Vec<RowMetadata>  // [start, end, row_id, txn_id, create_time, deleted_at]
arena_data: Vec<Value>         // Flattened row data
row_arena_index: Int64Map<usize>  // row_id -> arena_idx
```

**Usage pattern:**
1. Acquire `arena_rows` and `arena_data` read guards once
2. Iterate version chains checking visibility
3. For visible versions, read directly from arena via `arena_idx`
4. Release locks after iteration completes
5. Sort results by `row_id` (if needed)

## Data Access Paths

### Point Lookups

For point queries (e.g., `WHERE id = 5`):

1. Use primary key or index to locate the row
2. Apply visibility rules based on transaction
3. Return the visible version

### Range Scans

For range queries (e.g., `WHERE price > 100`):

1. Use B-tree index if available for the column
2. Scan matching index entries
3. Apply visibility rules to each row
4. Return visible results

### Full Table Scans

For queries without applicable indexes:

1. Scan all rows in the table
2. Apply WHERE clause filters
3. Apply visibility rules
4. For large tables, parallelize the scan

## Data Modification

### Insert Operations

When data is inserted:

1. Values are validated against column types
2. A new row version is created with the current transaction ID
3. The row is added to the primary row storage
4. Indexes are updated
5. The operation is recorded in the WAL (if enabled)

### Update Operations

When data is updated:

1. The existing row is located via indexes or scan
2. A new version is created with updated values
3. The new version links to the previous version
4. Indexes are updated to reflect the changes
5. The operation is recorded in the WAL (if enabled)

### Delete Operations

When data is deleted:

1. The existing row is located
2. A deletion marker version is created
3. Indexes are updated to reflect the deletion
4. The operation is recorded in the WAL (if enabled)

## Persistence and Recovery

When persistence is enabled:

### Write-Ahead Logging (WAL)

The storage engine provides crash consistency through Write-Ahead Logging (WAL) and periodic snapshots:

**WAL Manager responsibilities:**
- Sequential writes with CRC32 checksums
- Two-phase recovery: Phase 1 identifies committed transactions, Phase 2 applies their changes
- Compression (optional) for reduced I/O

**Snapshot system:**
- Binary format with magic bytes (`0x50414E53` = "SNAP")
- Atomic 3-phase writes: temp file → sync → rename
- Tracks source LSN for incremental recovery

**Recovery flow:**
1. Read snapshot metadata to get checkpoint LSN
2. Load table snapshots (fastest recovery path)
3. Replay WAL entries from checkpoint LSN forward
4. Apply only entries from committed transactions (two-phase)
5. Populate indexes in single pass after replay completes

### Write-Ahead Logging (WAL)

1. All modifications are recorded in the WAL before being applied
2. WAL entries include transaction ID, operation type, and data
3. WAL is flushed to disk for durability
4. This ensures recovery in case of crashes

### Snapshots

1. Periodically, consistent snapshots of tables are created
2. Snapshots contain the latest version of each row
3. Snapshots accelerate recovery compared to replaying the entire WAL

### Recovery Process

After a crash, recovery proceeds as follows:

1. The latest valid snapshot is loaded for each table
2. WAL entries after the snapshot are replayed
3. Index definitions are restored and indexes rebuilt
4. Incomplete transactions are rolled back

## Schema Management

The `MVCCEngine` maintains the schema catalog and handles DDL operations:

**Schema storage:**
- `schemas: Arc<RwLock<FxHashMap<String, Schema>>>`
- Lowercase table names for case-insensitive lookups
- Schema changes recorded to WAL

**ALTER TABLE operations:**
- `AddColumn`: Normalizes existing rows, adds default values
- `DropColumn`: Truncates row data
- `RenameColumn`: Updates schema, preserves data
- `ModifyColumn`: Type coercion with validation
- `RenameTable`: Updates all references atomically

**Schema normalization:**
When reading rows, `MVCCTable` normalizes row data to match the current schema, handling columns added/dropped via ALTER TABLE by filling defaults or truncating.

## Implementation Details

Core storage engine components in the Rust codebase:

```
src/storage/
├── mod.rs              # Storage module entry point
├── traits/             # Storage interfaces
│   ├── engine.rs       # Engine trait
│   ├── table.rs        # Table trait
│   └── transaction.rs  # Transaction trait
└── mvcc/               # MVCC implementation
    ├── engine.rs       # MVCC storage engine
    ├── table.rs        # Table with row storage
    ├── transaction.rs  # Transaction management
    ├── version_store.rs # Version tracking
    ├── btree_index.rs  # B-tree index
    ├── hash_index.rs   # Hash index
    ├── bitmap_index.rs # Bitmap index
    ├── multi_column_index.rs # Multi-column index
    └── persistence.rs  # WAL and snapshots
```

## Performance Characteristics

### Read Performance

- **Point Lookups** - O(1) with hash index, O(log n) with B-tree
- **Range Scans** - O(log n + k) with B-tree index
- **Full Scans** - Parallelized for large tables

### Write Performance

- **Inserts** - O(log n) per index
- **Updates** - O(log n) per index plus version creation
- **Deletes** - O(log n) per index for marker creation

### Concurrency

- **High Read Concurrency** - MVCC enables many concurrent readers
- **Write Concurrency** - Multiple writers with conflict detection
- **No Read Locks** - Readers never block on writes