featherdb-storage 1.0.0

B-tree storage engine with buffer pool, WAL, and compression for FeatherDB
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
# FeatherDB Storage Engine

The storage engine is the foundation of FeatherDB, providing durable, crash-safe storage with efficient page-based access patterns, multiple eviction policies, group commit WAL, and adaptive page compression.

## Architecture Overview

```
                    ┌─────────────────────────────────────────┐
                    │            StorageEngine                 │
                    │  (Unified interface for storage ops)     │
                    └───────────────────┬─────────────────────┘
          ┌─────────────────────────────┼─────────────────────────────┐
          │                             │                             │
          ▼                             ▼                             ▼
┌─────────────────────┐     ┌─────────────────────┐     ┌─────────────────────┐
│     Buffer Pool     │     │       B-Tree        │     │        WAL          │
│   (Page Caching)    │     │     (Indexes)       │     │    (Durability)     │
│                     │     │                     │     │                     │
│ Eviction Policies:  │     │ - Insert/Get/Delete │     │ Sync Modes:         │
│ - Clock (default)   │     │ - Range iteration   │     │ - Immediate         │
│ - LRU-2             │     │ - Page splitting    │     │ - GroupCommit       │
│ - LIRS              │     │ - Leaf linking      │     │ - NoSync            │
└──────────┬──────────┘     └──────────┬──────────┘     └──────────┬──────────┘
           │                           │                           │
           └───────────────────────────┼───────────────────────────┘
                    ┌─────────────────────────────────────────┐
                    │             FileManager                  │
                    │                                         │
                    │  ┌─────────────┐    ┌───────────────┐   │
                    │  │ Compression │    │  Encryption   │   │
                    │  │ (LZ4/ZSTD)  │    │ (AES-256-GCM) │   │
                    │  └─────────────┘    └───────────────┘   │
                    │                                         │
                    │              Disk I/O                   │
                    └─────────────────────────────────────────┘
                    ┌─────────────────────────────────────────┐
                    │              FreeList                    │
                    │       (Bitmap page allocation)          │
                    └─────────────────────────────────────────┘
```

## Key Components

### 1. Buffer Pool (`buffer_pool.rs`)

Page cache with pluggable eviction algorithms. The buffer pool manages an in-memory cache of database pages, reducing disk I/O for frequently accessed data.

**Key Features:**
- Configurable capacity (number of pages)
- Multiple eviction policies (see below)
- RAII-based page guards for safe access
- Pin counting to prevent eviction of in-use pages
- Statistics tracking (hits, misses, evictions, flushes)

**Structures:**
- `BufferPool` - Main cache manager
- `BufferFrame` - Individual cached page with metadata (pin count, dirty flag, reference bit)
- `PageGuard` - Read-only RAII handle for pinned pages
- `PageGuardMut` - Mutable RAII handle for pinned pages
- `BufferPoolStats` / `BufferPoolStatsSnapshot` - Performance metrics

**Usage:**
```rust
use featherdb_storage::{BufferPool, PageType};
use featherdb_core::EvictionPolicyType;

// Create with specific eviction policy
let pool = BufferPool::with_policy(1000, file_manager, EvictionPolicyType::Lru2);

// Allocate a new page
let (page_id, guard) = pool.allocate_page(PageType::Leaf)?;
{
    let mut page = guard.write();
    page.insert_cell(b"my data");
}

// Get existing page for reading
let guard = pool.get_page(page_id)?;
let page = guard.read();

// Get existing page for writing
let guard = pool.get_page_for_write(page_id)?;
let mut page = guard.write();

// Check statistics
let stats = pool.stats_snapshot();
println!("Hit ratio: {:.2}%", stats.hit_ratio() * 100.0);
```

### 2. Eviction Policies (`eviction.rs`)

Three eviction algorithms are implemented, selectable at buffer pool creation time:

#### Clock (Default)
Simple second-chance algorithm. Each page has a reference bit that is set on access. The clock hand sweeps through pages, evicting the first page with reference bit = 0, clearing bits as it goes.
- **Pros:** Low overhead, O(1) average case
- **Cons:** Less accurate than LRU for some workloads

#### LRU-2
Tracks the last 2 access times for each page. Pages are evicted based on the oldest second-to-last access time.
- **Pros:** Excellent scan resistance - pages accessed only once (like during a sequential scan) are evicted before frequently accessed pages
- **Cons:** Higher memory overhead per page
- **Correlated Reference Period:** Accesses within 10ms are considered correlated and don't count as separate accesses

#### LIRS (Low Inter-reference Recency Set)
Separates pages into two sets based on inter-reference recency:
- **LIR set (~99%):** Pages with low inter-reference recency (hot, frequently accessed)
- **HIR set (~1%):** Pages with high inter-reference recency (cold)

Only HIR resident pages can be evicted. When a HIR page is accessed twice within the LIR set's recency window, it becomes LIR.
- **Pros:** Best overall performance, adapts to workload changes
- **Cons:** Most complex implementation, highest memory overhead

**Configuration:**
```rust
use featherdb_core::{Config, EvictionPolicyType};

let config = Config::new("mydb.db")
    .with_eviction_policy(EvictionPolicyType::Lirs);
```

### 3. B-Tree (`btree.rs`)

B+ tree implementation for ordered key-value storage. Leaf pages are linked for efficient range scans.

**Operations:**
- `insert(key, value)` - Insert or update a key-value pair
- `get(key)` - Point lookup, returns `Option<Vec<u8>>`
- `delete(key)` - Remove entry, returns old value if present
- `iter()` - Iterate over all entries
- `range(start, end)` - Iterate over a key range

**Cell Encoding:**
- Leaf cells: `[key_len:u16][value_len:u32][key][value]`
- Branch cells: `[child_id:u64][key_len:u16][key]`

**Key Design Decisions:**
- Variable-length keys and values
- Leaf pages linked via `right_sibling` for efficient range scans
- Pages split when insert fails due to space
- Cascading splits handled recursively up to root
- New root created when old root splits

**Usage:**
```rust
use featherdb_storage::BTree;

// Create new B-tree
let mut btree = BTree::new(buffer_pool.clone())?;

// Or open existing
let btree = BTree::open(root_page_id, buffer_pool.clone());

// Insert data
btree.insert(b"user:1", b"Alice")?;
btree.insert(b"user:2", b"Bob")?;

// Lookup
if let Some(value) = btree.get(b"user:1")? {
    println!("Found: {}", String::from_utf8_lossy(&value));
}

// Range scan
for result in btree.range(b"user:1", b"user:9")? {
    let (key, value) = result?;
    println!("{}: {}",
        String::from_utf8_lossy(&key),
        String::from_utf8_lossy(&value));
}

// Delete
if let Some(old) = btree.delete(b"user:1")? {
    println!("Deleted: {}", String::from_utf8_lossy(&old));
}
```

### 4. Write-Ahead Log (`wal.rs`)

Ensures durability through logging before data changes. Supports three sync modes including group commit for high throughput.

**Sync Modes:**

| Mode | Description | Durability | Performance |
|------|-------------|------------|-------------|
| `Immediate` | Sync on every commit | Highest | Slowest |
| `GroupCommit` | Batch commits, single fsync | High | 2-5x faster |
| `NoSync` | No sync (OS decides) | Risk of loss | Fastest |

**Record Types:**
- `Begin` - Transaction start
- `Commit` - Transaction committed
- `Abort` - Transaction aborted
- `Write` - Page modification (stores old and new data)
- `Checkpoint` - Checkpoint marker

**WAL Record Format:**
```
┌────────────────────────────────────────────────────────┐
│ LSN (8 bytes)                                          │
│ Transaction ID (8 bytes)                               │
│ Record Type (1 byte)                                   │
│ Previous LSN (8 bytes) - for undo chain               │
│ Page ID (8 bytes)                                      │
│ Old Data Length (4 bytes)                              │
│ New Data Length (4 bytes)                              │
│ Old Data (variable)                                    │
│ New Data (variable)                                    │
│ CRC32 Checksum (4 bytes)                               │
└────────────────────────────────────────────────────────┘
```

**Group Commit Configuration:**
```rust
use featherdb_core::{Config, WalSyncMode, WalGroupCommitConfig};

let wal_config = WalGroupCommitConfig {
    sync_mode: WalSyncMode::GroupCommit,
    group_commit_interval_ms: 10,   // Flush every 10ms
    group_commit_max_batch: 100,    // Or when 100 records buffered
};

let config = Config::new("mydb.db")
    .with_wal_config(wal_config);
```

**Statistics:**
```rust
let stats = wal.stats();
println!("Total records: {}", stats.total_records());
println!("Group commits: {}", stats.group_commits());
println!("Average batch size: {:.1}", stats.average_batch_size());
println!("Fsync count: {}", stats.fsync_count());
```

### 5. Page Format (`page.rs`)

Fixed-size pages with slotted page structure for variable-length records.

**Page Types:**
- `Leaf` - B-tree leaf pages with key-value pairs
- `Branch` - B-tree internal nodes with separator keys and child pointers
- `Overflow` - For large values that don't fit in a single page
- `Free` - Available for allocation

**Page Layout (4KB default):**
```
┌──────────────────────────────────────────────────────────┐
│ Header (64 bytes)                                         │
│  ├─ page_id: u64 (8 bytes)                               │
│  ├─ page_type: u8 (1 byte)                               │
│  ├─ flags: u8 (dirty, encrypted, compressed)             │
│  ├─ version: u64 (MVCC version)                          │
│  ├─ checksum: u128 (XXH3-128)                            │
│  ├─ item_count: u16                                      │
│  ├─ free_start: u16 (slot array end)                     │
│  ├─ free_end: u16 (cell data start)                      │
│  ├─ parent: u64 (parent page ID)                         │
│  ├─ right_sibling: u64                                   │
│  └─ reserved: 6 bytes                                    │
├──────────────────────────────────────────────────────────┤
│ Slot Array (grows downward from header)                   │
│  └─ [offset:u16, len:u16, flags:u8, reserved:u8] (6B)   │
├──────────────────────────────────────────────────────────┤
│                    Free Space                             │
├──────────────────────────────────────────────────────────┤
│ Cell Data (grows upward from bottom)                      │
└──────────────────────────────────────────────────────────┘
```

**Checksum:** XXH3-128 for fast, high-quality integrity checking.

### 6. Compression (`compression.rs`)

Adaptive page compression with two algorithms:

**Compression Types:**

| Type | Description | Use Case |
|------|-------------|----------|
| `None` | No compression | Encrypted data, random data |
| `Lz4` | Fast compression | Default, balanced |
| `Zstd { level }` | High ratio | Cold data, archival |

**ZSTD Levels:**
- Level 1-3: Fast compression
- Level 3 (default): Good balance
- Level 9: High compression
- Level 19-22: Maximum compression

**Compressed Page Header (12 bytes):**
```
┌──────────────────────────────────────┐
│ Magic byte (0xC0)                    │
│ Compression type (0=none, 1=lz4, 2=zstd) │
│ Original size: u32                   │
│ Compressed size: u32                 │
│ Compression level: u8                │
│ Reserved: u8                         │
└──────────────────────────────────────┘
```

**Adaptive Behavior:**
- Pages smaller than threshold (default 512 bytes) are stored uncompressed
- If compression doesn't achieve at least 10% reduction, original data is stored
- Decompression auto-detects format via magic byte

**Configuration:**
```rust
use featherdb_core::Config;

// LZ4 compression (fast)
let config = Config::new("mydb.db").with_lz4_compression();

// ZSTD compression (better ratio)
let config = Config::new("mydb.db").with_zstd_compression();

// ZSTD with custom level
let config = Config::new("mydb.db").with_zstd_compression_level(9);

// Custom threshold
let config = Config::new("mydb.db")
    .with_compression_threshold(1024);  // Only compress pages > 1KB
```

**Statistics:**
```rust
let stats = file_manager.compression_stats();
println!("Compression ratio: {:.1}%", stats.compression_ratio() * 100.0);
println!("Space savings: {:.1}%", stats.space_savings_percent());
println!("Pages compressed: {}", stats.pages_compressed.load(Ordering::Relaxed));
println!("Pages skipped: {}", stats.pages_skipped.load(Ordering::Relaxed));
```

### 7. File Manager (`file.rs`)

Handles all disk I/O with optional encryption and compression.

**Write Pipeline:** Data -> Compress -> Encrypt -> Disk
**Read Pipeline:** Disk -> Decrypt -> Decompress -> Data

**Superblock (512 bytes):**
```
┌──────────────────────────────────────┐
│ Magic: "FeatherDB\0" (10 bytes)      │
│ Format version: u32                  │
│ Page size: u32                       │
│ Total pages: u64                     │
│ Free list head: u64                  │
│ Schema root: u64                     │
│ Last checkpoint LSN: u64             │
│ God byte: u8 (commit slot selector)  │
│ Encryption flags: u8                 │
│ Encryption salt: [u8; 32]            │
│ Reserved padding...                  │
└──────────────────────────────────────┘
```

### 8. Free List (`freelist.rs`)

Bitmap-based page allocation tracking.

**Features:**
- Bitmap stored in dedicated page
- Hint pointer for fast allocation (avoids scanning from start)
- Automatic bitmap extension when file grows
- Double-free detection

## Storage Limits

The storage engine supports configurable size limits to prevent runaway disk usage:

```rust
use featherdb_core::{Config, StorageLimitsConfig};

let config = Config::new("mydb.db")
    .with_max_database_size_mb(100)    // 100MB database limit
    .with_storage_limits(StorageLimitsConfig::new()
        .with_max_database_size(100 * 1024 * 1024)
        .with_max_wal_size(10 * 1024 * 1024)
        .with_safety_margin_percent(5)  // Reject at 95% capacity
    );

let storage = StorageEngine::open(config)?;

// Monitor storage quota
let quota = storage.storage_quota();
println!("Database: {} / {} bytes", quota.used_bytes, quota.limit_bytes.unwrap_or(0));
println!("WAL: {} / {} bytes", quota.wal_used_bytes, quota.wal_limit_bytes.unwrap_or(0));

if let Some(pct) = quota.usage_percent() {
    println!("Usage: {:.1}%", pct);
}

// Check if limit exceeded
if quota.is_exceeded() {
    eprintln!("Database size limit exceeded!");
}
```

**StorageQuota API:**

```rust
pub struct StorageQuota {
    pub used_bytes: u64,
    pub limit_bytes: Option<u64>,
    pub wal_used_bytes: u64,
    pub wal_limit_bytes: Option<u64>,
    pub disk_available_bytes: u64,
}

impl StorageQuota {
    pub fn remaining(&self) -> Option<u64>;
    pub fn usage_percent(&self) -> Option<f64>;
    pub fn is_exceeded(&self) -> bool;
    pub fn is_wal_exceeded(&self) -> bool;
}
```

## Configuration Options

```rust
use featherdb_core::{Config, EvictionPolicyType, WalSyncMode, WalGroupCommitConfig};

let config = Config::new("mydb.db")
    // Page settings
    .with_page_size(4096)              // Page size in bytes

    // Buffer pool settings
    .with_buffer_pool_pages(16384)     // 64MB with 4KB pages
    .with_eviction_policy(EvictionPolicyType::Lirs)

    // WAL settings
    .with_wal_config(WalGroupCommitConfig {
        sync_mode: WalSyncMode::GroupCommit,
        group_commit_interval_ms: 10,
        group_commit_max_batch: 100,
    })

    // Storage limits
    .with_max_database_size_mb(100)    // 100MB limit

    // Compression
    .with_lz4_compression()
    // or: .with_zstd_compression_level(3)

    // Encryption
    .with_password("secret_password")
    // or: .with_encryption_key(key_bytes)

    // File options
    .with_create_if_missing(true);
```

## Complete Usage Example

```rust
use featherdb_storage::{StorageEngine, BTree, BufferPool};
use featherdb_core::{Config, EvictionPolicyType, WalSyncMode, WalGroupCommitConfig};

// Configure storage
let config = Config::new("mydb.db")
    .with_buffer_pool_pages(1000)
    .with_eviction_policy(EvictionPolicyType::Lru2)
    .with_wal_config(WalGroupCommitConfig {
        sync_mode: WalSyncMode::GroupCommit,
        group_commit_interval_ms: 10,
        group_commit_max_batch: 50,
    })
    .with_lz4_compression();

// Open storage engine
let storage = StorageEngine::open(config)?;

// Create a B-tree index
let mut btree = BTree::new(storage.buffer_pool().clone())?;

// Insert data
for i in 0..1000 {
    let key = format!("key:{:05}", i);
    let value = format!("value:{}", i);
    btree.insert(key.as_bytes(), value.as_bytes())?;
}

// Point lookup
if let Some(value) = btree.get(b"key:00042")? {
    println!("Found: {}", String::from_utf8_lossy(&value));
}

// Range scan
println!("Keys 100-110:");
for result in btree.range(b"key:00100", b"key:00110")? {
    let (key, value) = result?;
    println!("  {} = {}",
        String::from_utf8_lossy(&key),
        String::from_utf8_lossy(&value));
}

// Check buffer pool stats
let stats = storage.buffer_pool().stats_snapshot();
println!("Buffer pool hit ratio: {:.2}%", stats.hit_ratio() * 100.0);

// Checkpoint for durability
let checkpoint_lsn = storage.checkpoint()?;
println!("Checkpoint at LSN: {:?}", checkpoint_lsn);
```

## Performance Characteristics

### Buffer Pool Sizing
- **Rule of thumb:** 10-20% of expected working set
- **Minimum:** At least enough for B-tree depth + active pages
- **Default:** 64MB (16384 pages at 4KB)

### Eviction Policy Selection
| Workload | Recommended Policy |
|----------|-------------------|
| General OLTP | Clock (low overhead) |
| Sequential scans mixed with point queries | LRU-2 |
| Complex workloads, adaptive | LIRS |

### WAL Performance
| Mode | Commits/sec (typical) | Data Safety |
|------|----------------------|-------------|
| Immediate | 100-500 | Fully durable |
| GroupCommit | 5,000-20,000 | Durable (batched) |
| NoSync | 50,000+ | Risk of loss on crash |

### Compression Trade-offs
| Algorithm | Speed | Ratio | Best For |
|-----------|-------|-------|----------|
| None | N/A | 1:1 | Encrypted/random data |
| LZ4 | Very Fast | 2-3x | General use |
| ZSTD(3) | Fast | 3-4x | Balanced |
| ZSTD(9+) | Slow | 4-6x | Cold data, archival |

## File Format

```
Offset 0-511:      Superblock
Offset 512:        Padding to page boundary
Offset 4096:       Page 0 (reserved)
Offset 8192:       Page 1 (Schema catalog root)
Offset 12288:      Page 2 (Free list bitmap)
Offset 16384+:     Data pages (B-tree nodes, etc.)
```

Note: When encryption is enabled, each page on disk includes additional overhead for the authentication tag.

## Testing

```bash
# Run all storage tests (using Make)
make test-crate CRATE=featherdb-storage

# Or with cargo directly
cargo test -p featherdb-storage

# Run with logging
RUST_LOG=debug cargo test -p featherdb-storage

# Run specific test
cargo test -p featherdb-storage test_btree_many_inserts

# Run benchmarks
cargo bench -p featherdb-storage

# Run coverage (from project root)
make coverage  # or: cargo llvm-cov --workspace
```

## Module Exports

The crate exports the following public items:

```rust
// Core types
pub use page::{Page, PageType, SlotEntry};
pub use btree::BTree;
pub use buffer_pool::{BufferPool, BufferPoolStats, BufferPoolStatsSnapshot, PageGuard};
pub use eviction::{
    EvictionPolicy, EvictionPolicyType, EvictionPolicyFactory,
    ClockEviction, Lru2Eviction, LirsEviction
};
pub use wal::{Wal, WalRecord, WalRecordType, WalStats};
pub use file::FileManager;
pub use freelist::FreeList;
pub use compression::{
    CompressionType, CompressionStats, Compressor, PageCompressor,
    CompressedPageHeader, create_compressor,
};
```