# FeatherDB Storage Engine
The storage engine is the foundation of FeatherDB, providing durable, crash-safe storage with efficient page-based access patterns, multiple eviction policies, group commit WAL, and adaptive page compression.
## Architecture Overview
```
┌─────────────────────────────────────────┐
│ StorageEngine │
│ (Unified interface for storage ops) │
└───────────────────┬─────────────────────┘
│
┌─────────────────────────────┼─────────────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐
│ Buffer Pool │ │ B-Tree │ │ WAL │
│ (Page Caching) │ │ (Indexes) │ │ (Durability) │
│ │ │ │ │ │
│ Eviction Policies: │ │ - Insert/Get/Delete │ │ Sync Modes: │
│ - Clock (default) │ │ - Range iteration │ │ - Immediate │
│ - LRU-2 │ │ - Page splitting │ │ - GroupCommit │
│ - LIRS │ │ - Leaf linking │ │ - NoSync │
└──────────┬──────────┘ └──────────┬──────────┘ └──────────┬──────────┘
│ │ │
└───────────────────────────┼───────────────────────────┘
▼
┌─────────────────────────────────────────┐
│ FileManager │
│ │
│ ┌─────────────┐ ┌───────────────┐ │
│ │ Compression │ │ Encryption │ │
│ │ (LZ4/ZSTD) │ │ (AES-256-GCM) │ │
│ └─────────────┘ └───────────────┘ │
│ │
│ Disk I/O │
└─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ FreeList │
│ (Bitmap page allocation) │
└─────────────────────────────────────────┘
```
## Key Components
### 1. Buffer Pool (`buffer_pool.rs`)
Page cache with pluggable eviction algorithms. The buffer pool manages an in-memory cache of database pages, reducing disk I/O for frequently accessed data.
**Key Features:**
- Configurable capacity (number of pages)
- Multiple eviction policies (see below)
- RAII-based page guards for safe access
- Pin counting to prevent eviction of in-use pages
- Statistics tracking (hits, misses, evictions, flushes)
**Structures:**
- `BufferPool` - Main cache manager
- `BufferFrame` - Individual cached page with metadata (pin count, dirty flag, reference bit)
- `PageGuard` - Read-only RAII handle for pinned pages
- `PageGuardMut` - Mutable RAII handle for pinned pages
- `BufferPoolStats` / `BufferPoolStatsSnapshot` - Performance metrics
**Usage:**
```rust
use featherdb_storage::{BufferPool, PageType};
use featherdb_core::EvictionPolicyType;
// Create with specific eviction policy
let pool = BufferPool::with_policy(1000, file_manager, EvictionPolicyType::Lru2);
// Allocate a new page
let (page_id, guard) = pool.allocate_page(PageType::Leaf)?;
{
let mut page = guard.write();
page.insert_cell(b"my data");
}
// Get existing page for reading
let guard = pool.get_page(page_id)?;
let page = guard.read();
// Get existing page for writing
let guard = pool.get_page_for_write(page_id)?;
let mut page = guard.write();
// Check statistics
let stats = pool.stats_snapshot();
println!("Hit ratio: {:.2}%", stats.hit_ratio() * 100.0);
```
### 2. Eviction Policies (`eviction.rs`)
Three eviction algorithms are implemented, selectable at buffer pool creation time:
#### Clock (Default)
Simple second-chance algorithm. Each page has a reference bit that is set on access. The clock hand sweeps through pages, evicting the first page with reference bit = 0, clearing bits as it goes.
- **Pros:** Low overhead, O(1) average case
- **Cons:** Less accurate than LRU for some workloads
#### LRU-2
Tracks the last 2 access times for each page. Pages are evicted based on the oldest second-to-last access time.
- **Pros:** Excellent scan resistance - pages accessed only once (like during a sequential scan) are evicted before frequently accessed pages
- **Cons:** Higher memory overhead per page
- **Correlated Reference Period:** Accesses within 10ms are considered correlated and don't count as separate accesses
#### LIRS (Low Inter-reference Recency Set)
Separates pages into two sets based on inter-reference recency:
- **LIR set (~99%):** Pages with low inter-reference recency (hot, frequently accessed)
- **HIR set (~1%):** Pages with high inter-reference recency (cold)
Only HIR resident pages can be evicted. When a HIR page is accessed twice within the LIR set's recency window, it becomes LIR.
- **Pros:** Best overall performance, adapts to workload changes
- **Cons:** Most complex implementation, highest memory overhead
**Configuration:**
```rust
use featherdb_core::{Config, EvictionPolicyType};
let config = Config::new("mydb.db")
.with_eviction_policy(EvictionPolicyType::Lirs);
```
### 3. B-Tree (`btree.rs`)
B+ tree implementation for ordered key-value storage. Leaf pages are linked for efficient range scans.
**Operations:**
- `insert(key, value)` - Insert or update a key-value pair
- `get(key)` - Point lookup, returns `Option<Vec<u8>>`
- `delete(key)` - Remove entry, returns old value if present
- `iter()` - Iterate over all entries
- `range(start, end)` - Iterate over a key range
**Cell Encoding:**
- Leaf cells: `[key_len:u16][value_len:u32][key][value]`
- Branch cells: `[child_id:u64][key_len:u16][key]`
**Key Design Decisions:**
- Variable-length keys and values
- Leaf pages linked via `right_sibling` for efficient range scans
- Pages split when insert fails due to space
- Cascading splits handled recursively up to root
- New root created when old root splits
**Usage:**
```rust
use featherdb_storage::BTree;
// Create new B-tree
let mut btree = BTree::new(buffer_pool.clone())?;
// Or open existing
let btree = BTree::open(root_page_id, buffer_pool.clone());
// Insert data
btree.insert(b"user:1", b"Alice")?;
btree.insert(b"user:2", b"Bob")?;
// Lookup
if let Some(value) = btree.get(b"user:1")? {
println!("Found: {}", String::from_utf8_lossy(&value));
}
// Range scan
for result in btree.range(b"user:1", b"user:9")? {
let (key, value) = result?;
println!("{}: {}",
String::from_utf8_lossy(&key),
String::from_utf8_lossy(&value));
}
// Delete
if let Some(old) = btree.delete(b"user:1")? {
println!("Deleted: {}", String::from_utf8_lossy(&old));
}
```
### 4. Write-Ahead Log (`wal.rs`)
Ensures durability through logging before data changes. Supports three sync modes including group commit for high throughput.
**Sync Modes:**
| `Immediate` | Sync on every commit | Highest | Slowest |
| `GroupCommit` | Batch commits, single fsync | High | 2-5x faster |
| `NoSync` | No sync (OS decides) | Risk of loss | Fastest |
**Record Types:**
- `Begin` - Transaction start
- `Commit` - Transaction committed
- `Abort` - Transaction aborted
- `Write` - Page modification (stores old and new data)
- `Checkpoint` - Checkpoint marker
**WAL Record Format:**
```
┌────────────────────────────────────────────────────────┐
│ LSN (8 bytes) │
│ Transaction ID (8 bytes) │
│ Record Type (1 byte) │
│ Previous LSN (8 bytes) - for undo chain │
│ Page ID (8 bytes) │
│ Old Data Length (4 bytes) │
│ New Data Length (4 bytes) │
│ Old Data (variable) │
│ New Data (variable) │
│ CRC32 Checksum (4 bytes) │
└────────────────────────────────────────────────────────┘
```
**Group Commit Configuration:**
```rust
use featherdb_core::{Config, WalSyncMode, WalGroupCommitConfig};
let wal_config = WalGroupCommitConfig {
sync_mode: WalSyncMode::GroupCommit,
group_commit_interval_ms: 10, // Flush every 10ms
group_commit_max_batch: 100, // Or when 100 records buffered
};
let config = Config::new("mydb.db")
.with_wal_config(wal_config);
```
**Statistics:**
```rust
let stats = wal.stats();
println!("Total records: {}", stats.total_records());
println!("Group commits: {}", stats.group_commits());
println!("Average batch size: {:.1}", stats.average_batch_size());
println!("Fsync count: {}", stats.fsync_count());
```
### 5. Page Format (`page.rs`)
Fixed-size pages with slotted page structure for variable-length records.
**Page Types:**
- `Leaf` - B-tree leaf pages with key-value pairs
- `Branch` - B-tree internal nodes with separator keys and child pointers
- `Overflow` - For large values that don't fit in a single page
- `Free` - Available for allocation
**Page Layout (4KB default):**
```
┌──────────────────────────────────────────────────────────┐
│ Header (64 bytes) │
│ ├─ page_id: u64 (8 bytes) │
│ ├─ page_type: u8 (1 byte) │
│ ├─ flags: u8 (dirty, encrypted, compressed) │
│ ├─ version: u64 (MVCC version) │
│ ├─ checksum: u128 (XXH3-128) │
│ ├─ item_count: u16 │
│ ├─ free_start: u16 (slot array end) │
│ ├─ free_end: u16 (cell data start) │
│ ├─ parent: u64 (parent page ID) │
│ ├─ right_sibling: u64 │
│ └─ reserved: 6 bytes │
├──────────────────────────────────────────────────────────┤
│ Slot Array (grows downward from header) │
│ └─ [offset:u16, len:u16, flags:u8, reserved:u8] (6B) │
├──────────────────────────────────────────────────────────┤
│ Free Space │
├──────────────────────────────────────────────────────────┤
│ Cell Data (grows upward from bottom) │
└──────────────────────────────────────────────────────────┘
```
**Checksum:** XXH3-128 for fast, high-quality integrity checking.
### 6. Compression (`compression.rs`)
Adaptive page compression with two algorithms:
**Compression Types:**
| `None` | No compression | Encrypted data, random data |
| `Lz4` | Fast compression | Default, balanced |
| `Zstd { level }` | High ratio | Cold data, archival |
**ZSTD Levels:**
- Level 1-3: Fast compression
- Level 3 (default): Good balance
- Level 9: High compression
- Level 19-22: Maximum compression
**Compressed Page Header (12 bytes):**
```
┌──────────────────────────────────────┐
│ Magic byte (0xC0) │
│ Compression type (0=none, 1=lz4, 2=zstd) │
│ Original size: u32 │
│ Compressed size: u32 │
│ Compression level: u8 │
│ Reserved: u8 │
└──────────────────────────────────────┘
```
**Adaptive Behavior:**
- Pages smaller than threshold (default 512 bytes) are stored uncompressed
- If compression doesn't achieve at least 10% reduction, original data is stored
- Decompression auto-detects format via magic byte
**Configuration:**
```rust
use featherdb_core::Config;
// LZ4 compression (fast)
let config = Config::new("mydb.db").with_lz4_compression();
// ZSTD compression (better ratio)
let config = Config::new("mydb.db").with_zstd_compression();
// ZSTD with custom level
let config = Config::new("mydb.db").with_zstd_compression_level(9);
// Custom threshold
let config = Config::new("mydb.db")
.with_compression_threshold(1024); // Only compress pages > 1KB
```
**Statistics:**
```rust
let stats = file_manager.compression_stats();
println!("Compression ratio: {:.1}%", stats.compression_ratio() * 100.0);
println!("Space savings: {:.1}%", stats.space_savings_percent());
println!("Pages compressed: {}", stats.pages_compressed.load(Ordering::Relaxed));
println!("Pages skipped: {}", stats.pages_skipped.load(Ordering::Relaxed));
```
### 7. File Manager (`file.rs`)
Handles all disk I/O with optional encryption and compression.
**Write Pipeline:** Data -> Compress -> Encrypt -> Disk
**Read Pipeline:** Disk -> Decrypt -> Decompress -> Data
**Superblock (512 bytes):**
```
┌──────────────────────────────────────┐
│ Magic: "FeatherDB\0" (10 bytes) │
│ Format version: u32 │
│ Page size: u32 │
│ Total pages: u64 │
│ Free list head: u64 │
│ Schema root: u64 │
│ Last checkpoint LSN: u64 │
│ God byte: u8 (commit slot selector) │
│ Encryption flags: u8 │
│ Encryption salt: [u8; 32] │
│ Reserved padding... │
└──────────────────────────────────────┘
```
### 8. Free List (`freelist.rs`)
Bitmap-based page allocation tracking.
**Features:**
- Bitmap stored in dedicated page
- Hint pointer for fast allocation (avoids scanning from start)
- Automatic bitmap extension when file grows
- Double-free detection
## Storage Limits
The storage engine supports configurable size limits to prevent runaway disk usage:
```rust
use featherdb_core::{Config, StorageLimitsConfig};
let config = Config::new("mydb.db")
.with_max_database_size_mb(100) // 100MB database limit
.with_storage_limits(StorageLimitsConfig::new()
.with_max_database_size(100 * 1024 * 1024)
.with_max_wal_size(10 * 1024 * 1024)
.with_safety_margin_percent(5) // Reject at 95% capacity
);
let storage = StorageEngine::open(config)?;
// Monitor storage quota
let quota = storage.storage_quota();
println!("Database: {} / {} bytes", quota.used_bytes, quota.limit_bytes.unwrap_or(0));
println!("WAL: {} / {} bytes", quota.wal_used_bytes, quota.wal_limit_bytes.unwrap_or(0));
if let Some(pct) = quota.usage_percent() {
println!("Usage: {:.1}%", pct);
}
// Check if limit exceeded
if quota.is_exceeded() {
eprintln!("Database size limit exceeded!");
}
```
**StorageQuota API:**
```rust
pub struct StorageQuota {
pub used_bytes: u64,
pub limit_bytes: Option<u64>,
pub wal_used_bytes: u64,
pub wal_limit_bytes: Option<u64>,
pub disk_available_bytes: u64,
}
impl StorageQuota {
pub fn remaining(&self) -> Option<u64>;
pub fn usage_percent(&self) -> Option<f64>;
pub fn is_exceeded(&self) -> bool;
pub fn is_wal_exceeded(&self) -> bool;
}
```
## Configuration Options
```rust
use featherdb_core::{Config, EvictionPolicyType, WalSyncMode, WalGroupCommitConfig};
let config = Config::new("mydb.db")
// Page settings
.with_page_size(4096) // Page size in bytes
// Buffer pool settings
.with_buffer_pool_pages(16384) // 64MB with 4KB pages
.with_eviction_policy(EvictionPolicyType::Lirs)
// WAL settings
.with_wal_config(WalGroupCommitConfig {
sync_mode: WalSyncMode::GroupCommit,
group_commit_interval_ms: 10,
group_commit_max_batch: 100,
})
// Storage limits
.with_max_database_size_mb(100) // 100MB limit
// Compression
.with_lz4_compression()
// or: .with_zstd_compression_level(3)
// Encryption
.with_password("secret_password")
// or: .with_encryption_key(key_bytes)
// File options
.with_create_if_missing(true);
```
## Complete Usage Example
```rust
use featherdb_storage::{StorageEngine, BTree, BufferPool};
use featherdb_core::{Config, EvictionPolicyType, WalSyncMode, WalGroupCommitConfig};
// Configure storage
let config = Config::new("mydb.db")
.with_buffer_pool_pages(1000)
.with_eviction_policy(EvictionPolicyType::Lru2)
.with_wal_config(WalGroupCommitConfig {
sync_mode: WalSyncMode::GroupCommit,
group_commit_interval_ms: 10,
group_commit_max_batch: 50,
})
.with_lz4_compression();
// Open storage engine
let storage = StorageEngine::open(config)?;
// Create a B-tree index
let mut btree = BTree::new(storage.buffer_pool().clone())?;
// Insert data
for i in 0..1000 {
let key = format!("key:{:05}", i);
let value = format!("value:{}", i);
btree.insert(key.as_bytes(), value.as_bytes())?;
}
// Point lookup
if let Some(value) = btree.get(b"key:00042")? {
println!("Found: {}", String::from_utf8_lossy(&value));
}
// Range scan
println!("Keys 100-110:");
for result in btree.range(b"key:00100", b"key:00110")? {
let (key, value) = result?;
println!(" {} = {}",
String::from_utf8_lossy(&key),
String::from_utf8_lossy(&value));
}
// Check buffer pool stats
let stats = storage.buffer_pool().stats_snapshot();
println!("Buffer pool hit ratio: {:.2}%", stats.hit_ratio() * 100.0);
// Checkpoint for durability
let checkpoint_lsn = storage.checkpoint()?;
println!("Checkpoint at LSN: {:?}", checkpoint_lsn);
```
## Performance Characteristics
### Buffer Pool Sizing
- **Rule of thumb:** 10-20% of expected working set
- **Minimum:** At least enough for B-tree depth + active pages
- **Default:** 64MB (16384 pages at 4KB)
### Eviction Policy Selection
| General OLTP | Clock (low overhead) |
| Sequential scans mixed with point queries | LRU-2 |
| Complex workloads, adaptive | LIRS |
### WAL Performance
| Immediate | 100-500 | Fully durable |
| GroupCommit | 5,000-20,000 | Durable (batched) |
| NoSync | 50,000+ | Risk of loss on crash |
### Compression Trade-offs
| None | N/A | 1:1 | Encrypted/random data |
| LZ4 | Very Fast | 2-3x | General use |
| ZSTD(3) | Fast | 3-4x | Balanced |
| ZSTD(9+) | Slow | 4-6x | Cold data, archival |
## File Format
```
Offset 0-511: Superblock
Offset 512: Padding to page boundary
Offset 4096: Page 0 (reserved)
Offset 8192: Page 1 (Schema catalog root)
Offset 12288: Page 2 (Free list bitmap)
Offset 16384+: Data pages (B-tree nodes, etc.)
```
Note: When encryption is enabled, each page on disk includes additional overhead for the authentication tag.
## Testing
```bash
# Run all storage tests (using Make)
make test-crate CRATE=featherdb-storage
# Or with cargo directly
cargo test -p featherdb-storage
# Run with logging
RUST_LOG=debug cargo test -p featherdb-storage
# Run specific test
cargo test -p featherdb-storage test_btree_many_inserts
# Run benchmarks
cargo bench -p featherdb-storage
# Run coverage (from project root)
make coverage # or: cargo llvm-cov --workspace
```
## Module Exports
The crate exports the following public items:
```rust
// Core types
pub use page::{Page, PageType, SlotEntry};
pub use btree::BTree;
pub use buffer_pool::{BufferPool, BufferPoolStats, BufferPoolStatsSnapshot, PageGuard};
pub use eviction::{
EvictionPolicy, EvictionPolicyType, EvictionPolicyFactory,
ClockEviction, Lru2Eviction, LirsEviction
};
pub use wal::{Wal, WalRecord, WalRecordType, WalStats};
pub use file::FileManager;
pub use freelist::FreeList;
pub use compression::{
CompressionType, CompressionStats, Compressor, PageCompressor,
CompressedPageHeader, create_compressor,
};
```