featherdb-storage 1.0.0

B-tree storage engine with buffer pool, WAL, and compression for FeatherDB
Documentation

FeatherDB Storage Engine

The storage engine is the foundation of FeatherDB, providing durable, crash-safe storage with efficient page-based access patterns, multiple eviction policies, group commit WAL, and adaptive page compression.

Architecture Overview

                    ┌─────────────────────────────────────────┐
                    │            StorageEngine                 │
                    │  (Unified interface for storage ops)     │
                    └───────────────────┬─────────────────────┘
                                        │
          ┌─────────────────────────────┼─────────────────────────────┐
          │                             │                             │
          ▼                             ▼                             ▼
┌─────────────────────┐     ┌─────────────────────┐     ┌─────────────────────┐
│     Buffer Pool     │     │       B-Tree        │     │        WAL          │
│   (Page Caching)    │     │     (Indexes)       │     │    (Durability)     │
│                     │     │                     │     │                     │
│ Eviction Policies:  │     │ - Insert/Get/Delete │     │ Sync Modes:         │
│ - Clock (default)   │     │ - Range iteration   │     │ - Immediate         │
│ - LRU-2             │     │ - Page splitting    │     │ - GroupCommit       │
│ - LIRS              │     │ - Leaf linking      │     │ - NoSync            │
└──────────┬──────────┘     └──────────┬──────────┘     └──────────┬──────────┘
           │                           │                           │
           └───────────────────────────┼───────────────────────────┘
                                       ▼
                    ┌─────────────────────────────────────────┐
                    │             FileManager                  │
                    │                                         │
                    │  ┌─────────────┐    ┌───────────────┐   │
                    │  │ Compression │    │  Encryption   │   │
                    │  │ (LZ4/ZSTD)  │    │ (AES-256-GCM) │   │
                    │  └─────────────┘    └───────────────┘   │
                    │                                         │
                    │              Disk I/O                   │
                    └─────────────────────────────────────────┘
                                       │
                                       ▼
                    ┌─────────────────────────────────────────┐
                    │              FreeList                    │
                    │       (Bitmap page allocation)          │
                    └─────────────────────────────────────────┘

Key Components

1. Buffer Pool (buffer_pool.rs)

Page cache with pluggable eviction algorithms. The buffer pool manages an in-memory cache of database pages, reducing disk I/O for frequently accessed data.

Key Features:

  • Configurable capacity (number of pages)
  • Multiple eviction policies (see below)
  • RAII-based page guards for safe access
  • Pin counting to prevent eviction of in-use pages
  • Statistics tracking (hits, misses, evictions, flushes)

Structures:

  • BufferPool - Main cache manager
  • BufferFrame - Individual cached page with metadata (pin count, dirty flag, reference bit)
  • PageGuard - Read-only RAII handle for pinned pages
  • PageGuardMut - Mutable RAII handle for pinned pages
  • BufferPoolStats / BufferPoolStatsSnapshot - Performance metrics

Usage:

use featherdb_storage::{BufferPool, PageType};
use featherdb_core::EvictionPolicyType;

// Create with specific eviction policy
let pool = BufferPool::with_policy(1000, file_manager, EvictionPolicyType::Lru2);

// Allocate a new page
let (page_id, guard) = pool.allocate_page(PageType::Leaf)?;
{
    let mut page = guard.write();
    page.insert_cell(b"my data");
}

// Get existing page for reading
let guard = pool.get_page(page_id)?;
let page = guard.read();

// Get existing page for writing
let guard = pool.get_page_for_write(page_id)?;
let mut page = guard.write();

// Check statistics
let stats = pool.stats_snapshot();
println!("Hit ratio: {:.2}%", stats.hit_ratio() * 100.0);

2. Eviction Policies (eviction.rs)

Three eviction algorithms are implemented, selectable at buffer pool creation time:

Clock (Default)

Simple second-chance algorithm. Each page has a reference bit that is set on access. The clock hand sweeps through pages, evicting the first page with reference bit = 0, clearing bits as it goes.

  • Pros: Low overhead, O(1) average case
  • Cons: Less accurate than LRU for some workloads

LRU-2

Tracks the last 2 access times for each page. Pages are evicted based on the oldest second-to-last access time.

  • Pros: Excellent scan resistance - pages accessed only once (like during a sequential scan) are evicted before frequently accessed pages
  • Cons: Higher memory overhead per page
  • Correlated Reference Period: Accesses within 10ms are considered correlated and don't count as separate accesses

LIRS (Low Inter-reference Recency Set)

Separates pages into two sets based on inter-reference recency:

  • LIR set (~99%): Pages with low inter-reference recency (hot, frequently accessed)
  • HIR set (~1%): Pages with high inter-reference recency (cold)

Only HIR resident pages can be evicted. When a HIR page is accessed twice within the LIR set's recency window, it becomes LIR.

  • Pros: Best overall performance, adapts to workload changes
  • Cons: Most complex implementation, highest memory overhead

Configuration:

use featherdb_core::{Config, EvictionPolicyType};

let config = Config::new("mydb.db")
    .with_eviction_policy(EvictionPolicyType::Lirs);

3. B-Tree (btree.rs)

B+ tree implementation for ordered key-value storage. Leaf pages are linked for efficient range scans.

Operations:

  • insert(key, value) - Insert or update a key-value pair
  • get(key) - Point lookup, returns Option<Vec<u8>>
  • delete(key) - Remove entry, returns old value if present
  • iter() - Iterate over all entries
  • range(start, end) - Iterate over a key range

Cell Encoding:

  • Leaf cells: [key_len:u16][value_len:u32][key][value]
  • Branch cells: [child_id:u64][key_len:u16][key]

Key Design Decisions:

  • Variable-length keys and values
  • Leaf pages linked via right_sibling for efficient range scans
  • Pages split when insert fails due to space
  • Cascading splits handled recursively up to root
  • New root created when old root splits

Usage:

use featherdb_storage::BTree;

// Create new B-tree
let mut btree = BTree::new(buffer_pool.clone())?;

// Or open existing
let btree = BTree::open(root_page_id, buffer_pool.clone());

// Insert data
btree.insert(b"user:1", b"Alice")?;
btree.insert(b"user:2", b"Bob")?;

// Lookup
if let Some(value) = btree.get(b"user:1")? {
    println!("Found: {}", String::from_utf8_lossy(&value));
}

// Range scan
for result in btree.range(b"user:1", b"user:9")? {
    let (key, value) = result?;
    println!("{}: {}",
        String::from_utf8_lossy(&key),
        String::from_utf8_lossy(&value));
}

// Delete
if let Some(old) = btree.delete(b"user:1")? {
    println!("Deleted: {}", String::from_utf8_lossy(&old));
}

4. Write-Ahead Log (wal.rs)

Ensures durability through logging before data changes. Supports three sync modes including group commit for high throughput.

Sync Modes:

Mode Description Durability Performance
Immediate Sync on every commit Highest Slowest
GroupCommit Batch commits, single fsync High 2-5x faster
NoSync No sync (OS decides) Risk of loss Fastest

Record Types:

  • Begin - Transaction start
  • Commit - Transaction committed
  • Abort - Transaction aborted
  • Write - Page modification (stores old and new data)
  • Checkpoint - Checkpoint marker

WAL Record Format:

┌────────────────────────────────────────────────────────┐
│ LSN (8 bytes)                                          │
│ Transaction ID (8 bytes)                               │
│ Record Type (1 byte)                                   │
│ Previous LSN (8 bytes) - for undo chain               │
│ Page ID (8 bytes)                                      │
│ Old Data Length (4 bytes)                              │
│ New Data Length (4 bytes)                              │
│ Old Data (variable)                                    │
│ New Data (variable)                                    │
│ CRC32 Checksum (4 bytes)                               │
└────────────────────────────────────────────────────────┘

Group Commit Configuration:

use featherdb_core::{Config, WalSyncMode, WalGroupCommitConfig};

let wal_config = WalGroupCommitConfig {
    sync_mode: WalSyncMode::GroupCommit,
    group_commit_interval_ms: 10,   // Flush every 10ms
    group_commit_max_batch: 100,    // Or when 100 records buffered
};

let config = Config::new("mydb.db")
    .with_wal_config(wal_config);

Statistics:

let stats = wal.stats();
println!("Total records: {}", stats.total_records());
println!("Group commits: {}", stats.group_commits());
println!("Average batch size: {:.1}", stats.average_batch_size());
println!("Fsync count: {}", stats.fsync_count());

5. Page Format (page.rs)

Fixed-size pages with slotted page structure for variable-length records.

Page Types:

  • Leaf - B-tree leaf pages with key-value pairs
  • Branch - B-tree internal nodes with separator keys and child pointers
  • Overflow - For large values that don't fit in a single page
  • Free - Available for allocation

Page Layout (4KB default):

┌──────────────────────────────────────────────────────────┐
│ Header (64 bytes)                                         │
│  ├─ page_id: u64 (8 bytes)                               │
│  ├─ page_type: u8 (1 byte)                               │
│  ├─ flags: u8 (dirty, encrypted, compressed)             │
│  ├─ version: u64 (MVCC version)                          │
│  ├─ checksum: u128 (XXH3-128)                            │
│  ├─ item_count: u16                                      │
│  ├─ free_start: u16 (slot array end)                     │
│  ├─ free_end: u16 (cell data start)                      │
│  ├─ parent: u64 (parent page ID)                         │
│  ├─ right_sibling: u64                                   │
│  └─ reserved: 6 bytes                                    │
├──────────────────────────────────────────────────────────┤
│ Slot Array (grows downward from header)                   │
│  └─ [offset:u16, len:u16, flags:u8, reserved:u8] (6B)   │
├──────────────────────────────────────────────────────────┤
│                    Free Space                             │
├──────────────────────────────────────────────────────────┤
│ Cell Data (grows upward from bottom)                      │
└──────────────────────────────────────────────────────────┘

Checksum: XXH3-128 for fast, high-quality integrity checking.

6. Compression (compression.rs)

Adaptive page compression with two algorithms:

Compression Types:

Type Description Use Case
None No compression Encrypted data, random data
Lz4 Fast compression Default, balanced
Zstd { level } High ratio Cold data, archival

ZSTD Levels:

  • Level 1-3: Fast compression
  • Level 3 (default): Good balance
  • Level 9: High compression
  • Level 19-22: Maximum compression

Compressed Page Header (12 bytes):

┌──────────────────────────────────────┐
│ Magic byte (0xC0)                    │
│ Compression type (0=none, 1=lz4, 2=zstd) │
│ Original size: u32                   │
│ Compressed size: u32                 │
│ Compression level: u8                │
│ Reserved: u8                         │
└──────────────────────────────────────┘

Adaptive Behavior:

  • Pages smaller than threshold (default 512 bytes) are stored uncompressed
  • If compression doesn't achieve at least 10% reduction, original data is stored
  • Decompression auto-detects format via magic byte

Configuration:

use featherdb_core::Config;

// LZ4 compression (fast)
let config = Config::new("mydb.db").with_lz4_compression();

// ZSTD compression (better ratio)
let config = Config::new("mydb.db").with_zstd_compression();

// ZSTD with custom level
let config = Config::new("mydb.db").with_zstd_compression_level(9);

// Custom threshold
let config = Config::new("mydb.db")
    .with_compression_threshold(1024);  // Only compress pages > 1KB

Statistics:

let stats = file_manager.compression_stats();
println!("Compression ratio: {:.1}%", stats.compression_ratio() * 100.0);
println!("Space savings: {:.1}%", stats.space_savings_percent());
println!("Pages compressed: {}", stats.pages_compressed.load(Ordering::Relaxed));
println!("Pages skipped: {}", stats.pages_skipped.load(Ordering::Relaxed));

7. File Manager (file.rs)

Handles all disk I/O with optional encryption and compression.

Write Pipeline: Data -> Compress -> Encrypt -> Disk Read Pipeline: Disk -> Decrypt -> Decompress -> Data

Superblock (512 bytes):

┌──────────────────────────────────────┐
│ Magic: "FeatherDB\0" (10 bytes)      │
│ Format version: u32                  │
│ Page size: u32                       │
│ Total pages: u64                     │
│ Free list head: u64                  │
│ Schema root: u64                     │
│ Last checkpoint LSN: u64             │
│ God byte: u8 (commit slot selector)  │
│ Encryption flags: u8                 │
│ Encryption salt: [u8; 32]            │
│ Reserved padding...                  │
└──────────────────────────────────────┘

8. Free List (freelist.rs)

Bitmap-based page allocation tracking.

Features:

  • Bitmap stored in dedicated page
  • Hint pointer for fast allocation (avoids scanning from start)
  • Automatic bitmap extension when file grows
  • Double-free detection

Storage Limits

The storage engine supports configurable size limits to prevent runaway disk usage:

use featherdb_core::{Config, StorageLimitsConfig};

let config = Config::new("mydb.db")
    .with_max_database_size_mb(100)    // 100MB database limit
    .with_storage_limits(StorageLimitsConfig::new()
        .with_max_database_size(100 * 1024 * 1024)
        .with_max_wal_size(10 * 1024 * 1024)
        .with_safety_margin_percent(5)  // Reject at 95% capacity
    );

let storage = StorageEngine::open(config)?;

// Monitor storage quota
let quota = storage.storage_quota();
println!("Database: {} / {} bytes", quota.used_bytes, quota.limit_bytes.unwrap_or(0));
println!("WAL: {} / {} bytes", quota.wal_used_bytes, quota.wal_limit_bytes.unwrap_or(0));

if let Some(pct) = quota.usage_percent() {
    println!("Usage: {:.1}%", pct);
}

// Check if limit exceeded
if quota.is_exceeded() {
    eprintln!("Database size limit exceeded!");
}

StorageQuota API:

pub struct StorageQuota {
    pub used_bytes: u64,
    pub limit_bytes: Option<u64>,
    pub wal_used_bytes: u64,
    pub wal_limit_bytes: Option<u64>,
    pub disk_available_bytes: u64,
}

impl StorageQuota {
    pub fn remaining(&self) -> Option<u64>;
    pub fn usage_percent(&self) -> Option<f64>;
    pub fn is_exceeded(&self) -> bool;
    pub fn is_wal_exceeded(&self) -> bool;
}

Configuration Options

use featherdb_core::{Config, EvictionPolicyType, WalSyncMode, WalGroupCommitConfig};

let config = Config::new("mydb.db")
    // Page settings
    .with_page_size(4096)              // Page size in bytes

    // Buffer pool settings
    .with_buffer_pool_pages(16384)     // 64MB with 4KB pages
    .with_eviction_policy(EvictionPolicyType::Lirs)

    // WAL settings
    .with_wal_config(WalGroupCommitConfig {
        sync_mode: WalSyncMode::GroupCommit,
        group_commit_interval_ms: 10,
        group_commit_max_batch: 100,
    })

    // Storage limits
    .with_max_database_size_mb(100)    // 100MB limit

    // Compression
    .with_lz4_compression()
    // or: .with_zstd_compression_level(3)

    // Encryption
    .with_password("secret_password")
    // or: .with_encryption_key(key_bytes)

    // File options
    .with_create_if_missing(true);

Complete Usage Example

use featherdb_storage::{StorageEngine, BTree, BufferPool};
use featherdb_core::{Config, EvictionPolicyType, WalSyncMode, WalGroupCommitConfig};

// Configure storage
let config = Config::new("mydb.db")
    .with_buffer_pool_pages(1000)
    .with_eviction_policy(EvictionPolicyType::Lru2)
    .with_wal_config(WalGroupCommitConfig {
        sync_mode: WalSyncMode::GroupCommit,
        group_commit_interval_ms: 10,
        group_commit_max_batch: 50,
    })
    .with_lz4_compression();

// Open storage engine
let storage = StorageEngine::open(config)?;

// Create a B-tree index
let mut btree = BTree::new(storage.buffer_pool().clone())?;

// Insert data
for i in 0..1000 {
    let key = format!("key:{:05}", i);
    let value = format!("value:{}", i);
    btree.insert(key.as_bytes(), value.as_bytes())?;
}

// Point lookup
if let Some(value) = btree.get(b"key:00042")? {
    println!("Found: {}", String::from_utf8_lossy(&value));
}

// Range scan
println!("Keys 100-110:");
for result in btree.range(b"key:00100", b"key:00110")? {
    let (key, value) = result?;
    println!("  {} = {}",
        String::from_utf8_lossy(&key),
        String::from_utf8_lossy(&value));
}

// Check buffer pool stats
let stats = storage.buffer_pool().stats_snapshot();
println!("Buffer pool hit ratio: {:.2}%", stats.hit_ratio() * 100.0);

// Checkpoint for durability
let checkpoint_lsn = storage.checkpoint()?;
println!("Checkpoint at LSN: {:?}", checkpoint_lsn);

Performance Characteristics

Buffer Pool Sizing

  • Rule of thumb: 10-20% of expected working set
  • Minimum: At least enough for B-tree depth + active pages
  • Default: 64MB (16384 pages at 4KB)

Eviction Policy Selection

Workload Recommended Policy
General OLTP Clock (low overhead)
Sequential scans mixed with point queries LRU-2
Complex workloads, adaptive LIRS

WAL Performance

Mode Commits/sec (typical) Data Safety
Immediate 100-500 Fully durable
GroupCommit 5,000-20,000 Durable (batched)
NoSync 50,000+ Risk of loss on crash

Compression Trade-offs

Algorithm Speed Ratio Best For
None N/A 1:1 Encrypted/random data
LZ4 Very Fast 2-3x General use
ZSTD(3) Fast 3-4x Balanced
ZSTD(9+) Slow 4-6x Cold data, archival

File Format

Offset 0-511:      Superblock
Offset 512:        Padding to page boundary
Offset 4096:       Page 0 (reserved)
Offset 8192:       Page 1 (Schema catalog root)
Offset 12288:      Page 2 (Free list bitmap)
Offset 16384+:     Data pages (B-tree nodes, etc.)

Note: When encryption is enabled, each page on disk includes additional overhead for the authentication tag.

Testing

# Run all storage tests (using Make)
make test-crate CRATE=featherdb-storage

# Or with cargo directly
cargo test -p featherdb-storage

# Run with logging
RUST_LOG=debug cargo test -p featherdb-storage

# Run specific test
cargo test -p featherdb-storage test_btree_many_inserts

# Run benchmarks
cargo bench -p featherdb-storage

# Run coverage (from project root)
make coverage  # or: cargo llvm-cov --workspace

Module Exports

The crate exports the following public items:

// Core types
pub use page::{Page, PageType, SlotEntry};
pub use btree::BTree;
pub use buffer_pool::{BufferPool, BufferPoolStats, BufferPoolStatsSnapshot, PageGuard};
pub use eviction::{
    EvictionPolicy, EvictionPolicyType, EvictionPolicyFactory,
    ClockEviction, Lru2Eviction, LirsEviction
};
pub use wal::{Wal, WalRecord, WalRecordType, WalStats};
pub use file::FileManager;
pub use freelist::FreeList;
pub use compression::{
    CompressionType, CompressionStats, Compressor, PageCompressor,
    CompressedPageHeader, create_compressor,
};