featherdb-mvcc 1.0.0

MVCC transaction layer with snapshot isolation for FeatherDB
Documentation

FeatherDB MVCC Layer

Multi-Version Concurrency Control (MVCC) provides snapshot isolation with single-writer, multiple-reader concurrency.

Overview

MVCC allows multiple transactions to read the database concurrently while a single writer makes modifications, without blocking each other. Each transaction sees a consistent snapshot of the database as of its start time, enabling:

  • Non-blocking reads: Readers never wait for writers
  • Non-blocking writes: Writers never wait for readers
  • Consistent snapshots: Each transaction sees a stable view of data
  • Conflict detection: Write-write conflicts are detected and prevented

Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                        TransactionManager                                │
│  - Assigns unique transaction IDs                                        │
│  - Tracks active and aborted transactions                                │
│  - Enforces single-writer semantics via write lock                       │
│  - Manages version store for old versions                                │
│  - Provides GC watermark based on oldest active transaction              │
└────────────────────────┬───────────────────────────────────────────────-┘
                         │
        ┌────────────────┼────────────────┬────────────────────┐
        │                │                │                    │
        ▼                ▼                ▼                    ▼
┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐
│ Transaction  │  │   Snapshot   │  │ VersionStore │  │  AbortedTxnSet   │
│              │  │              │  │              │  │                  │
│ - id         │  │ - txn_id     │  │ - versions   │  │ Tracks aborted   │
│ - mode       │  │ - hwm        │  │ - next_ptr   │  │ transaction IDs  │
│ - snapshot   │  │ - in_progress│  │              │  │ for visibility   │
│ - manager ref│  │ - ancestors  │  │              │  │ checks           │
└──────────────┘  └──────────────┘  └──────────────┘  └──────────────────┘

Key Components

1. TransactionManager (transaction.rs)

Central coordinator for all transactions in the database.

pub struct TransactionManager {
    next_txn_id: AtomicU64,                    // ID generator
    active_txns: RwLock<HashMap<...>>,         // Active transaction tracking
    write_lock: Mutex<()>,                     // Single-writer enforcement
    oldest_active: AtomicU64,                  // GC watermark
    version_store: RwLock<VersionStore>,       // Old versions storage
    aborted_txns: RwLock<HashSet<TransactionId>>, // Aborted transaction IDs
}

Responsibilities:

  • Assign unique, monotonically increasing transaction IDs
  • Track active transactions for snapshot creation
  • Track aborted transactions for visibility checks
  • Enforce single-writer semantics via a mutex
  • Provide the oldest active transaction ID for garbage collection
  • Coordinate transaction commit and abort (with optional WAL integration)

2. Transaction (transaction.rs)

Represents a unit of work in the database.

pub struct Transaction<'a> {
    id: TransactionId,
    mode: TransactionMode,     // ReadOnly or ReadWrite
    snapshot: Snapshot,
    manager: &'a TransactionManager,
}

Transaction Modes:

  • TransactionMode::ReadOnly - Cannot perform writes, returns error on write attempts
  • TransactionMode::ReadWrite - Full read and write capability

Key Features:

  • Auto-abort on drop: If a transaction is dropped without explicit commit, it is automatically aborted
  • WAL integration: Commit and abort can optionally log to the Write-Ahead Log

3. Snapshot (snapshot.rs)

Represents a point-in-time view of the database for consistent reads.

pub struct Snapshot {
    txn_id: TransactionId,                      // This transaction's ID
    high_water_mark: TransactionId,             // All IDs >= this are invisible
    in_progress_at_start: HashSet<TransactionId>, // Transactions active at snapshot time
    visible_ancestors: HashSet<TransactionId>,  // For savepoint/sub-transaction support
}

Visibility Rules:

A transaction can see changes made by created_by if:

fn can_see(&self, created_by: TransactionId) -> bool {
    // 1. Own changes are always visible
    if created_by == self.txn_id { return true; }

    // 2. Ancestor transactions are visible (savepoint support)
    if self.visible_ancestors.contains(&created_by) { return true; }

    // 3. Future transactions are never visible
    if created_by >= self.high_water_mark { return false; }

    // 4. Transactions in-progress at snapshot start are invisible
    !self.in_progress_at_start.contains(&created_by)
}

Row Visibility:

A row is visible if we can see its creation AND cannot see its deletion:

fn is_visible(&self, created_by: TransactionId, deleted_by: Option<TransactionId>) -> bool {
    // Must see the creation
    if !self.can_see(created_by) { return false; }

    // If deleted, must NOT see the deletion
    if let Some(deleted_by) = deleted_by {
        if self.can_see(deleted_by) { return false; }
    }

    true
}

4. Version Chain (version.rs)

Stores multiple versions of data for MVCC reads.

VersionedValue - Current (latest) version with MVCC metadata:

pub struct VersionedValue {
    current: Vec<Value>,               // Latest version data
    created_by: TransactionId,         // Transaction that created this version
    deleted_by: Option<TransactionId>, // Transaction that deleted this row (if any)
    version_chain: Option<VersionPtr>, // Link to older versions
}

OldVersion - Historical versions in the version chain:

pub struct OldVersion {
    data: Vec<Value>,
    created_by: TransactionId,
    deleted_by: Option<TransactionId>,
    prev_version: Option<VersionPtr>,  // Link to even older version
}

Version Chain Structure:

Current Version ──→ OldVersion₁ ──→ OldVersion₂ ──→ ... ──→ None
    (newest)                                            (oldest)

VersionChain Iterator:

The VersionChain struct provides an iterator for walking the version chain and finding visible versions:

let mut chain = VersionChain::new(&version_store, start_ptr);
if let Some(visible_version) = chain.find_visible(&snapshot) {
    // Use the visible version
}

5. Garbage Collection

Removes old versions that no active transaction can see.

GC Rules:

  • A version is garbage if deleted_by < oldest_active_txn
  • This means all active (and future) transactions can see the deletion
  • The old version will never be needed again
pub fn gc(&mut self, oldest_active: Option<TransactionId>) -> GcStats {
    // If no active transactions, collect all deleted versions
    // Otherwise, collect versions where deleted_by < oldest_active
}

GcStats:

pub struct GcStats {
    versions_removed: usize,  // Number of versions collected
    bytes_freed: usize,       // Approximate memory freed
}

Running GC:

let stats = tm.run_gc();
println!("Removed {} versions, freed ~{} bytes",
         stats.versions_removed, stats.bytes_freed);

Isolation Level

The MVCC implementation provides Snapshot Isolation:

Property Behavior
Dirty Reads Prevented - uncommitted changes are invisible
Non-repeatable Reads Prevented - snapshot is fixed at transaction start
Phantom Reads Prevented - snapshot is fixed at transaction start
Write Skew Possible - not prevented by snapshot isolation

Note: Snapshot isolation is weaker than Serializable isolation. Write skew anomalies can occur when two transactions read overlapping data and make disjoint writes based on what they read.

Write-Write Conflict Detection

Write-write conflicts are prevented through single-writer serialization:

// Only one read-write transaction can hold the write lock at a time
pub fn acquire_write_lock(&self) -> impl Drop + '_ {
    self.write_lock.lock()
}

Conflict detection at the row level:

// When attempting to modify a row, check if it was modified after our snapshot
if !self.snapshot.can_see(current_version.created_by) {
    // Another transaction modified this row after our snapshot started
    return Err(Error::WriteConflict);
}

Aborted Transaction Handling

The implementation tracks aborted transactions to ensure their changes remain invisible:

// Check visibility accounting for aborted transactions
pub fn is_row_visible(
    &self,
    snapshot: &Snapshot,
    created_by: TransactionId,
    deleted_by: Option<TransactionId>,
) -> bool {
    // If creator was aborted, row is invisible
    if self.is_aborted(created_by) { return false; }

    // If deleter was aborted, treat as not deleted
    let effective_deleted_by = match deleted_by {
        Some(del_by) if self.is_aborted(del_by) => None,
        other => other,
    };

    snapshot.is_visible(created_by, effective_deleted_by)
}

Cleanup of old aborted transaction IDs:

// Remove aborted txn IDs older than any active transaction's snapshot
tm.cleanup_aborted_txns(older_than_txn_id);

Transaction Lifecycle

1. BEGIN
   ├── Assign unique TransactionId
   ├── Create Snapshot (capture high_water_mark, in_progress set)
   └── Register in active_txns map

2. READ (can repeat)
   ├── Check current version visibility using snapshot
   ├── If not visible, walk version chain
   └── Return first visible version (or not found)

3. WRITE (can repeat, ReadWrite mode only)
   ├── Acquire write lock (blocks other writers)
   ├── Check for write-write conflict
   ├── Create new version with our txn_id as created_by
   └── Move old version to version chain

4. COMMIT or ABORT
   ├── COMMIT:
   │   ├── Log to WAL (if provided)
   │   ├── Remove from active_txns
   │   └── Update oldest_active watermark
   └── ABORT:
       ├── Log to WAL (if provided)
       ├── Remove from active_txns
       ├── Add to aborted_txns set
       └── Update oldest_active watermark

5. DROP (if not committed/aborted)
   └── Auto-abort the transaction

Usage Examples

Basic Transaction Usage

use featherdb_mvcc::{TransactionManager, TransactionMode};

let tm = TransactionManager::new();

// Start a read-write transaction
let txn = tm.begin(TransactionMode::ReadWrite);
println!("Transaction ID: {:?}", txn.id());

// Check if we can write (will error for read-only transactions)
txn.check_write()?;

// Check visibility of another transaction's changes
if txn.can_see(some_txn_id) {
    // Changes from some_txn_id are visible
}

// Commit the transaction
txn.commit()?;

Read-Only Transaction

let txn = tm.begin_read_only();

// Read operations here...

// Attempting to write will fail
assert!(txn.check_write().is_err());

// Read-only transactions can commit or abort
txn.commit()?;

Snapshot Visibility Check

let snapshot = txn.snapshot();

// Check if a version is visible
if snapshot.is_visible(created_by, deleted_by) {
    // The row is visible to this transaction
}

Running Garbage Collection

// Run GC manually
let stats = tm.run_gc();
println!("GC removed {} versions, ~{} bytes freed",
         stats.versions_removed, stats.bytes_freed);

// Check version store size
println!("Remaining old versions: {}", tm.version_count());
println!("Active transactions: {}", tm.active_count());
println!("Tracked aborted transactions: {}", tm.aborted_count());

WAL Integration

use featherdb_storage::Wal;

let mut wal = Wal::open("data/wal")?;
let txn = tm.begin_read_write();

// ... perform operations ...

// Commit with WAL logging
txn.commit_with_wal(&mut wal)?;

// Or abort with WAL logging
// txn.abort_with_wal(&mut wal)?;

Savepoint/Sub-transaction Support

use std::collections::HashSet;

// Create a snapshot with visible ancestors (for sub-transactions)
let mut ancestors = HashSet::new();
ancestors.insert(parent_txn_id);

let snapshot = Snapshot::with_ancestors(
    child_txn_id,
    high_water_mark,
    in_progress,
    ancestors,
);

// The child snapshot can now see the parent's uncommitted changes
assert!(snapshot.can_see(parent_txn_id));

Concurrency Guarantees

Operation Behavior
Read-Read No blocking - concurrent reads always allowed
Read-Write No blocking - readers see their snapshot
Write-Read No blocking - writer's uncommitted changes invisible to readers
Write-Write Serialized - single writer lock prevents concurrent writes

Configuration Options

Currently, the MVCC layer uses sensible defaults. Configurable options include:

Option Default Description
Initial TxnID 1 Starting transaction ID
Write Lock Mutex Single-writer enforcement
Version Store In-memory HashMap Storage for old versions

Performance Considerations

  1. Version Chain Length: Long chains slow down reads as they require traversal. Run GC regularly to keep chains short.

  2. Active Transaction Count: Long-running transactions prevent GC from reclaiming old versions. Keep transactions short.

  3. Write Lock Contention: All writes are serialized through a single lock. This simplifies implementation but can bottleneck write-heavy workloads.

  4. Aborted Transaction Set: The aborted transaction set grows until cleanup. Call cleanup_aborted_txns() periodically.

  5. Snapshot Memory: Each snapshot stores the set of in-progress transactions. Many concurrent transactions increase memory usage.

Testing

# Run all MVCC tests (using Make)
make test-crate CRATE=featherdb-mvcc

# Or with cargo directly
cargo test -p featherdb-mvcc

# Test specific components
cargo test -p featherdb-mvcc snapshot
cargo test -p featherdb-mvcc transaction
cargo test -p featherdb-mvcc version

# Run with output
cargo test -p featherdb-mvcc -- --nocapture

# Run coverage (from project root)
make coverage  # or: cargo llvm-cov --workspace

Transaction Timeout

Transactions can be configured with automatic timeouts to prevent long-running transactions from blocking the system.

Configuration

use featherdb_core::TransactionConfig;
use std::time::Duration;

// Create config with 5 second timeout
let config = TransactionConfig::new()
    .with_timeout(Duration::from_secs(5).as_millis() as u64);

let txn = tm.begin_with_config(config);

Timeout Behavior

When a transaction exceeds its configured timeout:

  • All subsequent operations fail with Error::TransactionTimeout
  • The transaction must be rolled back
  • Timeout is checked automatically on every operation

Timeout API

// Check if transaction has expired
if txn.is_expired() {
    println!("Transaction timed out!");
}

// Get remaining time before timeout
if let Some(remaining) = txn.remaining_time() {
    println!("{}ms remaining", remaining.as_millis());
}

// Get elapsed time since transaction started
let elapsed = txn.elapsed();
println!("Running for {}ms", elapsed.as_millis());

Error Handling

match txn.execute("UPDATE accounts SET balance = 100") {
    Ok(_) => println!("Success"),
    Err(Error::TransactionTimeout { transaction_id, elapsed_ms, timeout_ms }) => {
        println!("Transaction {} timed out after {}ms (limit: {}ms)",
                 transaction_id, elapsed_ms, timeout_ms);
        txn.rollback()?;
    },
    Err(e) => println!("Other error: {}", e),
}

Deadlock Detection

Deadlock detection prevents circular wait conditions between transactions using cycle detection in a wait-for graph.

How It Works

  1. Wait Graph: Tracks which transactions are waiting for which other transactions
  2. Cycle Detection: Uses DFS to detect cycles in the wait graph
  3. Victim Selection: Aborts the youngest transaction (highest ID) in the cycle
  4. Automatic Resolution: Deadlocks are detected and resolved automatically

WaitGraph API

use featherdb_mvcc::WaitGraph;

let mut graph = WaitGraph::new();

// Record that txn1 is waiting for txn2
graph.add_wait(txn1, txn2);
graph.add_wait(txn2, txn3);
graph.add_wait(txn3, txn1); // Creates a cycle

// Detect the cycle
if let Some(cycle) = graph.detect_cycle() {
    println!("Deadlock detected: {:?}", cycle);
    let victim = graph.select_victim(&cycle);
    println!("Aborting transaction {}", victim);
}

// Clean up when transaction completes
graph.remove_transaction(txn1);

Integration Example

// Deadlock detection is integrated into the transaction manager
// When a transaction waits for a lock:
wait_graph.add_wait(waiter_txn, holder_txn);

if let Some(cycle) = wait_graph.detect_cycle() {
    let victim = wait_graph.select_victim(&cycle);
    // Abort the victim transaction
    tm.abort(victim)?;
    return Err(Error::Deadlock);
}

Long-Running Transaction Detection

Monitor long-running transactions to identify potential GC blockers and performance issues.

Detection API

use std::time::Duration;

// Find transactions running longer than 5 seconds
let long_running = tm.long_running_transactions(Duration::from_secs(5));

for txn_info in &long_running {
    println!("Transaction {}: running for {:?}", txn_info.id, txn_info.age);
}

GC Status Monitoring

// Check if GC is blocked
let gc_status = tm.gc_status();

if gc_status.blocked {
    println!("GC blocked by transaction {:?}", gc_status.oldest_active_txn);
    println!("Reason: {:?}", gc_status.blocked_reason);
}

TransactionInfo

pub struct TransactionInfo {
    pub id: u64,           // Transaction ID
    pub age: Duration,     // How long it has been running
}

GcStatus

pub struct GcStatus {
    pub oldest_active_txn: Option<u64>,        // Oldest transaction blocking GC
    pub blocked: bool,                         // Whether GC is blocked
    pub blocked_reason: Option<String>,        // Human-readable reason
}

Use Cases

Database Monitoring:

// Periodic health check
let long_txns = tm.long_running_transactions(Duration::from_secs(5));
if !long_txns.is_empty() {
    warn!("Found {} long-running transactions", long_txns.len());
}

GC Health Check:

let status = tm.gc_status();
if status.blocked {
    info!("GC blocked: {}", status.blocked_reason.unwrap());
    // Take action: notify admin, log warning, etc.
}

Performance Analysis:

// Correlate long-running transactions with GC blocking
let long_running = tm.long_running_transactions(Duration::from_secs(1));
let gc_status = tm.gc_status();

if let Some(oldest) = gc_status.oldest_active_txn {
    if long_running.iter().any(|t| t.id == oldest) {
        warn!("Long-running transaction {} is blocking GC!", oldest);
    }
}

Adaptive Garbage Collection

Adaptive GC automatically adjusts collection frequency based on workload patterns and metrics.

Configuration

use featherdb_mvcc::gc::{GcConfig, GcScheduler};

let config = GcConfig {
    min_interval_ms: 1000,           // Minimum 1 second between GC runs
    max_versions_before_gc: 10000,   // Trigger GC at 10k versions
    adaptive: true,                  // Enable adaptive scheduling
};

let mut scheduler = GcScheduler::new(config);

GC Metrics

// Get current metrics
let metrics = scheduler.metrics().snapshot();

println!("Versions created: {}", metrics.versions_created);
println!("Versions cleaned: {}", metrics.versions_cleaned);
println!("GC runs: {}", metrics.gc_runs);
println!("GC time: {}μs", metrics.gc_time_us);
println!("Pending: {}", metrics.versions_pending);

Adaptive Scheduling

// Check if GC should run based on adaptive heuristics
if scheduler.should_run_gc() {
    let start = Instant::now();
    let stats = tm.run_gc();
    scheduler.record_gc_run(start.elapsed());
    scheduler.record_versions_cleaned(stats.versions_removed as u64);
}

// Record version creation
scheduler.metrics().record_version_created();

How Adaptive GC Works

  1. Metrics Tracking: Monitors versions created, cleaned, and pending
  2. Heuristics: Decides when to run GC based on:
    • Time since last GC run (respects min_interval)
    • Number of pending versions (respects max_versions_before_gc)
    • Historical GC efficiency
  3. Performance: Prevents unnecessary GC runs when system is idle
  4. Safety: Ensures GC runs when memory pressure is high

Future Improvements

  • First-committer-wins write-write conflict detection
  • Serializable isolation level (prevent write skew)
  • Multi-version indexes
  • Background GC thread with configurable interval
  • Read-only transaction optimization (skip active tracking)