oxibase 0.2.2

Autonomous relational database management system with MVCC, time-travel queries, and full ACID compliance
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
---
layout: default
title: Persistence
parent: Architecture
nav_order: 7
---

# Persistence

Oxibase provides durable storage through a combination of Write-Ahead Logging (WAL) and periodic snapshots. This architecture ensures data durability while maintaining high performance.

## Overview

Oxibase's persistence layer consists of two main components:

1. **Write-Ahead Log (WAL)**: Records all changes before they're applied to memory
2. **Snapshots**: Periodic full copies of the database state

This dual approach provides:
- **Durability**: Changes are persisted before acknowledgment
- **Fast Recovery**: Snapshots reduce recovery time
- **Crash Safety**: WAL ensures no committed transactions are lost

## Enabling Persistence

To enable persistence, use a `file://` connection string:

```rust
use oxibase::Database;

// In-memory only (no persistence)
let db = Database::open("memory://")?;

// With disk persistence
let db = Database::open("file:///path/to/database")?;
```

Command line:
```bash
# In-memory
oxibase

# With persistence
oxibase --db "file:///path/to/database"
```

## Write-Ahead Log (WAL)

### How WAL Works

1. When a transaction commits, changes are first written to the WAL file
2. The WAL is synced to disk (based on sync_mode)
3. Changes are then applied to the in-memory structures
4. Transaction is acknowledged to the client

This sequence ensures that committed transactions survive crashes.

### WAL Manager Responsibilities

The WAL manager handles sequential writes with durability guarantees:

- **Sequential writes** with CRC32 checksums for corruption detection
- **Two-phase recovery**: Phase 1 identifies committed transactions, Phase 2 applies their changes
- **Optional compression** for reduced I/O when enabled
- **Atomic writes** ensuring either complete entries or none

**WAL Entry Structure:**
```
[WAL Header]
Magic: 0x57414C00
Version: 1
Checksum: CRC32

[Entry 1]
Timestamp: i64
TxnId: i64
OpType: Insert/Update/Delete
TableId: i64
RowData: [serialized row]
Checksum: CRC32

[Entry 2]
...
```

### WAL Operations

The WAL records these operations:
- **INSERT**: New row insertions with full row data
- **UPDATE**: Row modifications with old→new value pairs
- **DELETE**: Row deletions with deletion markers
- **CREATE TABLE**: Table creation with schema metadata (DDL)
- **DROP TABLE**: Table deletion with cleanup operations (DDL)
- **CREATE INDEX**: Index creation with index specifications (DDL)
- **DROP INDEX**: Index deletion operations (DDL)
- **ALTER TABLE**: Schema changes with transformation rules

### WAL Segment Management

WAL files are managed in segments for efficient rotation:

- **Active segment**: Currently being written to
- **Archived segments**: Completed segments waiting for snapshot
- **Cleanup**: Old segments deleted after successful snapshots
- **Recovery**: Segments replayed in order during startup

### WAL Configuration

Configure WAL behavior using PRAGMA:

```sql
-- Sync mode: 0=None, 1=Normal (default), 2=Full
PRAGMA sync_mode = 1;

-- Number of operations before automatic WAL flush
PRAGMA wal_flush_trigger = 32768;
```

| Sync Mode | Value | Behavior |
|-----------|-------|----------|
| None | 0 | No sync (fastest, but data may be lost on crash) |
| Normal | 1 | Sync on commit (balanced performance and durability) |
| Full | 2 | Sync every operation (slowest, maximum durability) |

### WAL Files

WAL files are stored in the database directory:
```
/path/to/database/
  wal/
    wal_000001.log
    wal_000002.log
    ...
```

Old WAL files are automatically cleaned up after successful snapshots.

## Snapshots

### How Snapshots Work

Snapshots capture the complete database state at a point in time:
1. All table data and schema
2. All index definitions
3. Current transaction state

After a snapshot is created, older WAL entries can be safely deleted.

### Snapshot System Architecture

The snapshot system uses binary format with atomic writes:

- **Binary format** with magic bytes (`0x50414E53` = "SNAP")
- **Atomic 3-phase writes**: temp file → sync → rename for crash safety
- **Tracks source LSN** for incremental recovery
- **Compressed storage** to reduce disk usage

**Snapshot File Structure:**
```
[Snapshot Header]
Magic: 0x50414E53
Version: 1
Timestamp: i64
LSN: i64 (WAL position)
TableCount: u32
Checksum: CRC32

[Table 1 Schema]
TableId: i64
Name: String
Columns: [Column definitions]
Indexes: [Index definitions]

[Table 1 Data]
RowCount: u64
Rows: [serialized rows...]

[Table 2 Schema]
...

[Footer]
TotalSize: u64
Checksum: CRC32
```

### 3-Phase Atomic Write Process

Snapshots use atomic writes to prevent corruption:

1. **Phase 1 - Write**: Create temporary file with complete snapshot data
2. **Phase 2 - Sync**: Force all writes to disk with fsync()
3. **Phase 3 - Rename**: Atomically move temp file to final location

This ensures either a complete, valid snapshot exists, or none at all.

### Snapshot Configuration

```sql
-- Interval between automatic snapshots (in seconds, default: 300)
PRAGMA snapshot_interval = 300;

-- Number of snapshots to retain (default: 5)
PRAGMA keep_snapshots = 5;

-- Manually create a snapshot
PRAGMA create_snapshot;
```

### Snapshot Files

Snapshots are stored as binary files:
```
/path/to/database/
  snapshots/
    snapshot_1704067200.bin
    snapshot_1704067500.bin
    ...
```

The filename includes the Unix timestamp of creation for easy sorting.

### Snapshot Retention Policy

- **Automatic cleanup**: Old snapshots deleted when `keep_snapshots` exceeded
- **Time-based retention**: Can be configured for compliance requirements
- **Manual retention**: Important snapshots can be preserved indefinitely

## Recovery Process

When opening a database, Oxibase performs recovery automatically:

### 2-Phase Recovery Algorithm

**Phase 1 - Analysis:**
1. Read snapshot metadata to get checkpoint LSN (Log Sequence Number)
2. Scan WAL entries to identify committed vs. uncommitted transactions
3. Build transaction dependency graph for proper ordering

**Phase 2 - Redo:**
1. Load table snapshots for fastest recovery path
2. Replay WAL entries from checkpoint LSN forward
3. Apply only entries from committed transactions
4. Rebuild indexes in single pass after replay completes

### Detailed Recovery Flow

```mermaid
graph TD
    A[Database Open] --> B{Find Latest Snapshot?}
    B -->|Yes| C[Load Snapshot Metadata]
    B -->|No| D[Start from Empty State]

    C --> E[Get Checkpoint LSN]
    D --> E

    E --> F[Phase 1: Analyze WAL]
    F --> G[Identify Committed Transactions]
    G --> H[Build Transaction Graph]

    H --> I[Phase 2: Replay WAL]
    I --> J[Apply Committed Entries]
    J --> K[Rebuild Indexes]

    K --> L[Verify Data Consistency]
    L --> M[Database Ready]
```

### WAL Replay Process

During WAL replay, Oxibase:

1. **Validates checksums** on each WAL entry for corruption detection
2. **Applies operations** in strict commit order using LSN sequencing
3. **Handles DDL operations** by reconstructing schema changes
4. **Rebuilds indexes** from scratch after all data is loaded (faster than incremental updates)
5. **Validates constraints** to ensure data integrity

### Index Rebuilding

After WAL replay, indexes are rebuilt efficiently:

- **Single-pass rebuild**: All indexes populated in one table scan
- **Batch operations**: Reduces lock contention during rebuild
- **Parallel building**: Multiple indexes built concurrently where possible
- **Memory optimization**: Uses arena-based storage to minimize allocations

### Recovery Performance

Recovery time depends on:
- **Snapshot recency**: More recent snapshots = less WAL to replay
- **WAL size**: Amount of uncheckpointed changes
- **Index complexity**: Time to rebuild indexes after replay
- **Hardware**: SSDs provide much faster recovery than HDDs

### Error Recovery

If corruption is detected during recovery:

1. **Partial recovery**: Recover up to last valid WAL entry
2. **Snapshot fallback**: Use older snapshot if current is corrupted
3. **Logging**: Detailed logs of recovery progress and any issues
4. **Graceful degradation**: Database opens in consistent state even with some data loss

Recovery is transparent to the application.

### Recovery Example

```rust
// Opening automatically triggers recovery if needed
let db = Database::open("file:///path/to/database")?;

// Database is ready with all committed data restored
let results = db.query("SELECT * FROM users")?;
```

### Recovery Monitoring

Monitor recovery progress programmatically:

```sql
-- Check if database is in recovery mode
PRAGMA recovery_status;

-- Get recovery statistics after startup
PRAGMA recovery_stats;
```

## Configuration Options

### Connection String Parameters

```
file:///path/to/database?sync_mode=2&snapshot_interval=60&keep_snapshots=3
```

| Parameter | Description | Default |
|-----------|-------------|---------|
| sync_mode | WAL sync mode (0, 1, 2) | 1 |
| snapshot_interval | Seconds between snapshots | 300 |
| keep_snapshots | Number of snapshots to keep | 5 |

### PRAGMA Commands

```sql
-- Read current settings
PRAGMA sync_mode;
PRAGMA snapshot_interval;
PRAGMA keep_snapshots;
PRAGMA wal_flush_trigger;

-- Modify settings
PRAGMA sync_mode = 2;
PRAGMA snapshot_interval = 60;
PRAGMA keep_snapshots = 3;
PRAGMA wal_flush_trigger = 10000;

-- Manually trigger snapshot
PRAGMA create_snapshot;
```

## Performance Characteristics

### WAL Performance

| Sync Mode | Durability | Performance Impact | Use Case |
|-----------|------------|-------------------|----------|
| **0 (None)** | None | ~5-10x faster | Development only |
| **1 (Normal)** | Commit durability | ~2x slower | General production |
| **2 (Full)** | Every operation | ~10x slower | Critical data |

### Snapshot Performance

- **Creation time**: Proportional to database size
- **Memory usage**: Minimal (shared data structures)
- **I/O pattern**: Sequential writes, atomic rename
- **Compression**: Optional, reduces size by 20-50%

### Recovery Performance

| Scenario | Recovery Time | Notes |
|----------|---------------|-------|
| **Clean shutdown** | < 1 second | No WAL replay needed |
| **Recent snapshot** | 1-30 seconds | Minimal WAL replay |
| **No recent snapshot** | 1-5 minutes | Full WAL replay |
| **Large database** | 5-30 minutes | Index rebuilding |

### WAL Compression

When enabled, WAL compression provides:

- **Space savings**: 30-70% reduction in WAL size
- **CPU overhead**: 5-15% increase in write latency
- **Recovery speed**: Faster due to smaller files
- **Memory usage**: Minimal additional overhead

## Best Practices

### Durability vs Performance

Choose sync_mode based on your requirements:

| Use Case | Recommended sync_mode | Trade-offs |
|----------|----------------------|------------|
| Development/Testing | 0 (None) | Fastest, no durability |
| General Use | 1 (Normal) | Balanced performance/durability |
| Financial/Critical Data | 2 (Full) | Maximum durability, slower |

### Advanced Configuration

#### WAL Tuning

```sql
-- For high-write workloads
PRAGMA wal_flush_trigger = 10000;  -- Flush every 10k operations
PRAGMA sync_mode = 1;               -- Normal sync

-- For low-write, high-durability
PRAGMA wal_flush_trigger = 1000;   -- Flush more frequently
PRAGMA sync_mode = 2;               -- Full sync
```

#### Snapshot Strategy

```sql
-- For high-availability systems
PRAGMA snapshot_interval = 60;     -- Every minute
PRAGMA keep_snapshots = 10;        -- Keep more snapshots

-- For cost-optimized systems
PRAGMA snapshot_interval = 3600;   -- Every hour
PRAGMA keep_snapshots = 3;         -- Fewer snapshots
```

### Snapshot Frequency

- **Frequent snapshots** (low interval): Faster recovery, more disk I/O
- **Infrequent snapshots** (high interval): Slower recovery, less disk I/O

For databases with high write rates, consider shorter intervals.

### Disk Space Management

Monitor disk usage:
- WAL files grow until the next snapshot
- Old snapshots are retained based on `keep_snapshots`
- Plan for peak WAL size between snapshots

**Disk Space Formula:**
```
Peak WAL Size = Write Rate × Snapshot Interval
Total Snapshots = keep_snapshots × Average Snapshot Size
```

### Backup Strategy

For backups:
1. Create a manual snapshot: `PRAGMA create_snapshot;`
2. Copy the entire database directory while the database is idle
3. For hot backups, use filesystem snapshots (ZFS, LVM)

### Monitoring and Maintenance

#### Health Checks

```sql
-- WAL statistics
PRAGMA wal_stats;

-- Snapshot information
PRAGMA snapshot_info;

-- Recovery status
PRAGMA recovery_status;
```

#### Maintenance Tasks

```sql
-- Force WAL checkpoint (creates snapshot)
PRAGMA create_snapshot;

-- Clean up old snapshots manually
PRAGMA cleanup_snapshots;

-- Validate database integrity
PRAGMA integrity_check;
```

## Directory Structure

A persistent database creates this directory structure:

```
/path/to/database/
  db.lock              # Lock file for single-writer
  wal/
    wal_NNNNNN.log     # WAL segment files
  snapshots/
    snapshot_TIMESTAMP.bin  # Snapshot files
```

## Error Handling

### Corrupt WAL

If WAL corruption is detected during recovery:
- Oxibase attempts to recover up to the last valid entry
- Corrupted entries at the end are discarded
- A warning is logged

### Disk Full

If disk becomes full:
- WAL writes will fail
- Transactions will be rolled back
- Free disk space before continuing

### Lock Contention

Only one process can open a database directory:
- A lock file (`db.lock`) prevents concurrent access
- If a previous process crashed, the lock is automatically released on open

## Example: Complete Configuration

```rust
use oxibase::Database;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Open with custom persistence settings
    let db = Database::open(
        "file:///var/lib/myapp/data?sync_mode=2&snapshot_interval=120"
    )?;

    // Fine-tune at runtime
    db.execute("PRAGMA wal_flush_trigger = 5000")?;
    db.execute("PRAGMA keep_snapshots = 7")?;

    // Your application logic...
    db.execute("CREATE TABLE events (id INTEGER PRIMARY KEY AUTO_INCREMENT, data JSON)")?;

    // Force a snapshot before maintenance
    db.execute("PRAGMA create_snapshot")?;

    Ok(())
}
```