FeatherDB Storage Engine
The storage engine is the foundation of FeatherDB, providing durable, crash-safe storage with efficient page-based access patterns, multiple eviction policies, group commit WAL, and adaptive page compression.
Architecture Overview
┌─────────────────────────────────────────┐
│ StorageEngine │
│ (Unified interface for storage ops) │
└───────────────────┬─────────────────────┘
│
┌─────────────────────────────┼─────────────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐
│ Buffer Pool │ │ B-Tree │ │ WAL │
│ (Page Caching) │ │ (Indexes) │ │ (Durability) │
│ │ │ │ │ │
│ Eviction Policies: │ │ - Insert/Get/Delete │ │ Sync Modes: │
│ - Clock (default) │ │ - Range iteration │ │ - Immediate │
│ - LRU-2 │ │ - Page splitting │ │ - GroupCommit │
│ - LIRS │ │ - Leaf linking │ │ - NoSync │
└──────────┬──────────┘ └──────────┬──────────┘ └──────────┬──────────┘
│ │ │
└───────────────────────────┼───────────────────────────┘
▼
┌─────────────────────────────────────────┐
│ FileManager │
│ │
│ ┌─────────────┐ ┌───────────────┐ │
│ │ Compression │ │ Encryption │ │
│ │ (LZ4/ZSTD) │ │ (AES-256-GCM) │ │
│ └─────────────┘ └───────────────┘ │
│ │
│ Disk I/O │
└─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ FreeList │
│ (Bitmap page allocation) │
└─────────────────────────────────────────┘
Key Components
1. Buffer Pool (buffer_pool.rs)
Page cache with pluggable eviction algorithms. The buffer pool manages an in-memory cache of database pages, reducing disk I/O for frequently accessed data.
Key Features:
- Configurable capacity (number of pages)
- Multiple eviction policies (see below)
- RAII-based page guards for safe access
- Pin counting to prevent eviction of in-use pages
- Statistics tracking (hits, misses, evictions, flushes)
Structures:
BufferPool- Main cache managerBufferFrame- Individual cached page with metadata (pin count, dirty flag, reference bit)PageGuard- Read-only RAII handle for pinned pagesPageGuardMut- Mutable RAII handle for pinned pagesBufferPoolStats/BufferPoolStatsSnapshot- Performance metrics
Usage:
use ;
use EvictionPolicyType;
// Create with specific eviction policy
let pool = with_policy;
// Allocate a new page
let = pool.allocate_page?;
// Get existing page for reading
let guard = pool.get_page?;
let page = guard.read;
// Get existing page for writing
let guard = pool.get_page_for_write?;
let mut page = guard.write;
// Check statistics
let stats = pool.stats_snapshot;
println!;
2. Eviction Policies (eviction.rs)
Three eviction algorithms are implemented, selectable at buffer pool creation time:
Clock (Default)
Simple second-chance algorithm. Each page has a reference bit that is set on access. The clock hand sweeps through pages, evicting the first page with reference bit = 0, clearing bits as it goes.
- Pros: Low overhead, O(1) average case
- Cons: Less accurate than LRU for some workloads
LRU-2
Tracks the last 2 access times for each page. Pages are evicted based on the oldest second-to-last access time.
- Pros: Excellent scan resistance - pages accessed only once (like during a sequential scan) are evicted before frequently accessed pages
- Cons: Higher memory overhead per page
- Correlated Reference Period: Accesses within 10ms are considered correlated and don't count as separate accesses
LIRS (Low Inter-reference Recency Set)
Separates pages into two sets based on inter-reference recency:
- LIR set (~99%): Pages with low inter-reference recency (hot, frequently accessed)
- HIR set (~1%): Pages with high inter-reference recency (cold)
Only HIR resident pages can be evicted. When a HIR page is accessed twice within the LIR set's recency window, it becomes LIR.
- Pros: Best overall performance, adapts to workload changes
- Cons: Most complex implementation, highest memory overhead
Configuration:
use ;
let config = new
.with_eviction_policy;
3. B-Tree (btree.rs)
B+ tree implementation for ordered key-value storage. Leaf pages are linked for efficient range scans.
Operations:
insert(key, value)- Insert or update a key-value pairget(key)- Point lookup, returnsOption<Vec<u8>>delete(key)- Remove entry, returns old value if presentiter()- Iterate over all entriesrange(start, end)- Iterate over a key range
Cell Encoding:
- Leaf cells:
[key_len:u16][value_len:u32][key][value] - Branch cells:
[child_id:u64][key_len:u16][key]
Key Design Decisions:
- Variable-length keys and values
- Leaf pages linked via
right_siblingfor efficient range scans - Pages split when insert fails due to space
- Cascading splits handled recursively up to root
- New root created when old root splits
Usage:
use BTree;
// Create new B-tree
let mut btree = new?;
// Or open existing
let btree = open;
// Insert data
btree.insert?;
btree.insert?;
// Lookup
if let Some = btree.get?
// Range scan
for result in btree.range?
// Delete
if let Some = btree.delete?
4. Write-Ahead Log (wal.rs)
Ensures durability through logging before data changes. Supports three sync modes including group commit for high throughput.
Sync Modes:
| Mode | Description | Durability | Performance |
|---|---|---|---|
Immediate |
Sync on every commit | Highest | Slowest |
GroupCommit |
Batch commits, single fsync | High | 2-5x faster |
NoSync |
No sync (OS decides) | Risk of loss | Fastest |
Record Types:
Begin- Transaction startCommit- Transaction committedAbort- Transaction abortedWrite- Page modification (stores old and new data)Checkpoint- Checkpoint marker
WAL Record Format:
┌────────────────────────────────────────────────────────┐
│ LSN (8 bytes) │
│ Transaction ID (8 bytes) │
│ Record Type (1 byte) │
│ Previous LSN (8 bytes) - for undo chain │
│ Page ID (8 bytes) │
│ Old Data Length (4 bytes) │
│ New Data Length (4 bytes) │
│ Old Data (variable) │
│ New Data (variable) │
│ CRC32 Checksum (4 bytes) │
└────────────────────────────────────────────────────────┘
Group Commit Configuration:
use ;
let wal_config = WalGroupCommitConfig ;
let config = new
.with_wal_config;
Statistics:
let stats = wal.stats;
println!;
println!;
println!;
println!;
5. Page Format (page.rs)
Fixed-size pages with slotted page structure for variable-length records.
Page Types:
Leaf- B-tree leaf pages with key-value pairsBranch- B-tree internal nodes with separator keys and child pointersOverflow- For large values that don't fit in a single pageFree- Available for allocation
Page Layout (4KB default):
┌──────────────────────────────────────────────────────────┐
│ Header (64 bytes) │
│ ├─ page_id: u64 (8 bytes) │
│ ├─ page_type: u8 (1 byte) │
│ ├─ flags: u8 (dirty, encrypted, compressed) │
│ ├─ version: u64 (MVCC version) │
│ ├─ checksum: u128 (XXH3-128) │
│ ├─ item_count: u16 │
│ ├─ free_start: u16 (slot array end) │
│ ├─ free_end: u16 (cell data start) │
│ ├─ parent: u64 (parent page ID) │
│ ├─ right_sibling: u64 │
│ └─ reserved: 6 bytes │
├──────────────────────────────────────────────────────────┤
│ Slot Array (grows downward from header) │
│ └─ [offset:u16, len:u16, flags:u8, reserved:u8] (6B) │
├──────────────────────────────────────────────────────────┤
│ Free Space │
├──────────────────────────────────────────────────────────┤
│ Cell Data (grows upward from bottom) │
└──────────────────────────────────────────────────────────┘
Checksum: XXH3-128 for fast, high-quality integrity checking.
6. Compression (compression.rs)
Adaptive page compression with two algorithms:
Compression Types:
| Type | Description | Use Case |
|---|---|---|
None |
No compression | Encrypted data, random data |
Lz4 |
Fast compression | Default, balanced |
Zstd { level } |
High ratio | Cold data, archival |
ZSTD Levels:
- Level 1-3: Fast compression
- Level 3 (default): Good balance
- Level 9: High compression
- Level 19-22: Maximum compression
Compressed Page Header (12 bytes):
┌──────────────────────────────────────┐
│ Magic byte (0xC0) │
│ Compression type (0=none, 1=lz4, 2=zstd) │
│ Original size: u32 │
│ Compressed size: u32 │
│ Compression level: u8 │
│ Reserved: u8 │
└──────────────────────────────────────┘
Adaptive Behavior:
- Pages smaller than threshold (default 512 bytes) are stored uncompressed
- If compression doesn't achieve at least 10% reduction, original data is stored
- Decompression auto-detects format via magic byte
Configuration:
use Config;
// LZ4 compression (fast)
let config = new.with_lz4_compression;
// ZSTD compression (better ratio)
let config = new.with_zstd_compression;
// ZSTD with custom level
let config = new.with_zstd_compression_level;
// Custom threshold
let config = new
.with_compression_threshold; // Only compress pages > 1KB
Statistics:
let stats = file_manager.compression_stats;
println!;
println!;
println!;
println!;
7. File Manager (file.rs)
Handles all disk I/O with optional encryption and compression.
Write Pipeline: Data -> Compress -> Encrypt -> Disk Read Pipeline: Disk -> Decrypt -> Decompress -> Data
Superblock (512 bytes):
┌──────────────────────────────────────┐
│ Magic: "FeatherDB\0" (10 bytes) │
│ Format version: u32 │
│ Page size: u32 │
│ Total pages: u64 │
│ Free list head: u64 │
│ Schema root: u64 │
│ Last checkpoint LSN: u64 │
│ God byte: u8 (commit slot selector) │
│ Encryption flags: u8 │
│ Encryption salt: [u8; 32] │
│ Reserved padding... │
└──────────────────────────────────────┘
8. Free List (freelist.rs)
Bitmap-based page allocation tracking.
Features:
- Bitmap stored in dedicated page
- Hint pointer for fast allocation (avoids scanning from start)
- Automatic bitmap extension when file grows
- Double-free detection
Storage Limits
The storage engine supports configurable size limits to prevent runaway disk usage:
use ;
let config = new
.with_max_database_size_mb // 100MB database limit
.with_storage_limits;
let storage = open?;
// Monitor storage quota
let quota = storage.storage_quota;
println!;
println!;
if let Some = quota.usage_percent
// Check if limit exceeded
if quota.is_exceeded
StorageQuota API:
Configuration Options
use ;
let config = new
// Page settings
.with_page_size // Page size in bytes
// Buffer pool settings
.with_buffer_pool_pages // 64MB with 4KB pages
.with_eviction_policy
// WAL settings
.with_wal_config
// Storage limits
.with_max_database_size_mb // 100MB limit
// Compression
.with_lz4_compression
// or: .with_zstd_compression_level(3)
// Encryption
.with_password
// or: .with_encryption_key(key_bytes)
// File options
.with_create_if_missing;
Complete Usage Example
use ;
use ;
// Configure storage
let config = new
.with_buffer_pool_pages
.with_eviction_policy
.with_wal_config
.with_lz4_compression;
// Open storage engine
let storage = open?;
// Create a B-tree index
let mut btree = new?;
// Insert data
for i in 0..1000
// Point lookup
if let Some = btree.get?
// Range scan
println!;
for result in btree.range?
// Check buffer pool stats
let stats = storage.buffer_pool.stats_snapshot;
println!;
// Checkpoint for durability
let checkpoint_lsn = storage.checkpoint?;
println!;
Performance Characteristics
Buffer Pool Sizing
- Rule of thumb: 10-20% of expected working set
- Minimum: At least enough for B-tree depth + active pages
- Default: 64MB (16384 pages at 4KB)
Eviction Policy Selection
| Workload | Recommended Policy |
|---|---|
| General OLTP | Clock (low overhead) |
| Sequential scans mixed with point queries | LRU-2 |
| Complex workloads, adaptive | LIRS |
WAL Performance
| Mode | Commits/sec (typical) | Data Safety |
|---|---|---|
| Immediate | 100-500 | Fully durable |
| GroupCommit | 5,000-20,000 | Durable (batched) |
| NoSync | 50,000+ | Risk of loss on crash |
Compression Trade-offs
| Algorithm | Speed | Ratio | Best For |
|---|---|---|---|
| None | N/A | 1:1 | Encrypted/random data |
| LZ4 | Very Fast | 2-3x | General use |
| ZSTD(3) | Fast | 3-4x | Balanced |
| ZSTD(9+) | Slow | 4-6x | Cold data, archival |
File Format
Offset 0-511: Superblock
Offset 512: Padding to page boundary
Offset 4096: Page 0 (reserved)
Offset 8192: Page 1 (Schema catalog root)
Offset 12288: Page 2 (Free list bitmap)
Offset 16384+: Data pages (B-tree nodes, etc.)
Note: When encryption is enabled, each page on disk includes additional overhead for the authentication tag.
Testing
# Run all storage tests (using Make)
# Or with cargo directly
# Run with logging
RUST_LOG=debug
# Run specific test
# Run benchmarks
# Run coverage (from project root)
Module Exports
The crate exports the following public items:
// Core types
pub use ;
pub use BTree;
pub use ;
pub use ;
pub use ;
pub use FileManager;
pub use FreeList;
pub use ;