# FeatherDB MVCC Layer
Multi-Version Concurrency Control (MVCC) provides snapshot isolation with single-writer, multiple-reader concurrency.
## Overview
MVCC allows multiple transactions to read the database concurrently while a single writer makes modifications, without blocking each other. Each transaction sees a consistent snapshot of the database as of its start time, enabling:
- **Non-blocking reads**: Readers never wait for writers
- **Non-blocking writes**: Writers never wait for readers
- **Consistent snapshots**: Each transaction sees a stable view of data
- **Conflict detection**: Write-write conflicts are detected and prevented
## Architecture
```
┌─────────────────────────────────────────────────────────────────────────┐
│ TransactionManager │
│ - Assigns unique transaction IDs │
│ - Tracks active and aborted transactions │
│ - Enforces single-writer semantics via write lock │
│ - Manages version store for old versions │
│ - Provides GC watermark based on oldest active transaction │
└────────────────────────┬───────────────────────────────────────────────-┘
│
┌────────────────┼────────────────┬────────────────────┐
│ │ │ │
▼ ▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐
│ Transaction │ │ Snapshot │ │ VersionStore │ │ AbortedTxnSet │
│ │ │ │ │ │ │ │
│ - id │ │ - txn_id │ │ - versions │ │ Tracks aborted │
│ - mode │ │ - hwm │ │ - next_ptr │ │ transaction IDs │
│ - snapshot │ │ - in_progress│ │ │ │ for visibility │
│ - manager ref│ │ - ancestors │ │ │ │ checks │
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────────┘
```
## Key Components
### 1. TransactionManager (`transaction.rs`)
Central coordinator for all transactions in the database.
```rust
pub struct TransactionManager {
next_txn_id: AtomicU64, // ID generator
active_txns: RwLock<HashMap<...>>, // Active transaction tracking
write_lock: Mutex<()>, // Single-writer enforcement
oldest_active: AtomicU64, // GC watermark
version_store: RwLock<VersionStore>, // Old versions storage
aborted_txns: RwLock<HashSet<TransactionId>>, // Aborted transaction IDs
}
```
**Responsibilities:**
- Assign unique, monotonically increasing transaction IDs
- Track active transactions for snapshot creation
- Track aborted transactions for visibility checks
- Enforce single-writer semantics via a mutex
- Provide the oldest active transaction ID for garbage collection
- Coordinate transaction commit and abort (with optional WAL integration)
### 2. Transaction (`transaction.rs`)
Represents a unit of work in the database.
```rust
pub struct Transaction<'a> {
id: TransactionId,
mode: TransactionMode, // ReadOnly or ReadWrite
snapshot: Snapshot,
manager: &'a TransactionManager,
}
```
**Transaction Modes:**
- `TransactionMode::ReadOnly` - Cannot perform writes, returns error on write attempts
- `TransactionMode::ReadWrite` - Full read and write capability
**Key Features:**
- **Auto-abort on drop**: If a transaction is dropped without explicit commit, it is automatically aborted
- **WAL integration**: Commit and abort can optionally log to the Write-Ahead Log
### 3. Snapshot (`snapshot.rs`)
Represents a point-in-time view of the database for consistent reads.
```rust
pub struct Snapshot {
txn_id: TransactionId, // This transaction's ID
high_water_mark: TransactionId, // All IDs >= this are invisible
in_progress_at_start: HashSet<TransactionId>, // Transactions active at snapshot time
visible_ancestors: HashSet<TransactionId>, // For savepoint/sub-transaction support
}
```
**Visibility Rules:**
A transaction can see changes made by `created_by` if:
```rust
fn can_see(&self, created_by: TransactionId) -> bool {
// 1. Own changes are always visible
if created_by == self.txn_id { return true; }
// 2. Ancestor transactions are visible (savepoint support)
if self.visible_ancestors.contains(&created_by) { return true; }
// 3. Future transactions are never visible
if created_by >= self.high_water_mark { return false; }
// 4. Transactions in-progress at snapshot start are invisible
!self.in_progress_at_start.contains(&created_by)
}
```
**Row Visibility:**
A row is visible if we can see its creation AND cannot see its deletion:
```rust
fn is_visible(&self, created_by: TransactionId, deleted_by: Option<TransactionId>) -> bool {
// Must see the creation
if !self.can_see(created_by) { return false; }
// If deleted, must NOT see the deletion
if let Some(deleted_by) = deleted_by {
if self.can_see(deleted_by) { return false; }
}
true
}
```
### 4. Version Chain (`version.rs`)
Stores multiple versions of data for MVCC reads.
**VersionedValue** - Current (latest) version with MVCC metadata:
```rust
pub struct VersionedValue {
current: Vec<Value>, // Latest version data
created_by: TransactionId, // Transaction that created this version
deleted_by: Option<TransactionId>, // Transaction that deleted this row (if any)
version_chain: Option<VersionPtr>, // Link to older versions
}
```
**OldVersion** - Historical versions in the version chain:
```rust
pub struct OldVersion {
data: Vec<Value>,
created_by: TransactionId,
deleted_by: Option<TransactionId>,
prev_version: Option<VersionPtr>, // Link to even older version
}
```
**Version Chain Structure:**
```
Current Version ──→ OldVersion₁ ──→ OldVersion₂ ──→ ... ──→ None
(newest) (oldest)
```
**VersionChain Iterator:**
The `VersionChain` struct provides an iterator for walking the version chain and finding visible versions:
```rust
let mut chain = VersionChain::new(&version_store, start_ptr);
if let Some(visible_version) = chain.find_visible(&snapshot) {
// Use the visible version
}
```
### 5. Garbage Collection
Removes old versions that no active transaction can see.
**GC Rules:**
- A version is garbage if `deleted_by < oldest_active_txn`
- This means all active (and future) transactions can see the deletion
- The old version will never be needed again
```rust
pub fn gc(&mut self, oldest_active: Option<TransactionId>) -> GcStats {
// If no active transactions, collect all deleted versions
// Otherwise, collect versions where deleted_by < oldest_active
}
```
**GcStats:**
```rust
pub struct GcStats {
versions_removed: usize, // Number of versions collected
bytes_freed: usize, // Approximate memory freed
}
```
**Running GC:**
```rust
let stats = tm.run_gc();
println!("Removed {} versions, freed ~{} bytes",
stats.versions_removed, stats.bytes_freed);
```
## Isolation Level
The MVCC implementation provides **Snapshot Isolation**:
| Dirty Reads | Prevented - uncommitted changes are invisible |
| Non-repeatable Reads | Prevented - snapshot is fixed at transaction start |
| Phantom Reads | Prevented - snapshot is fixed at transaction start |
| Write Skew | Possible - not prevented by snapshot isolation |
**Note:** Snapshot isolation is weaker than Serializable isolation. Write skew anomalies can occur when two transactions read overlapping data and make disjoint writes based on what they read.
## Write-Write Conflict Detection
Write-write conflicts are prevented through single-writer serialization:
```rust
// Only one read-write transaction can hold the write lock at a time
pub fn acquire_write_lock(&self) -> impl Drop + '_ {
self.write_lock.lock()
}
```
Conflict detection at the row level:
```rust
// When attempting to modify a row, check if it was modified after our snapshot
if !self.snapshot.can_see(current_version.created_by) {
// Another transaction modified this row after our snapshot started
return Err(Error::WriteConflict);
}
```
## Aborted Transaction Handling
The implementation tracks aborted transactions to ensure their changes remain invisible:
```rust
// Check visibility accounting for aborted transactions
pub fn is_row_visible(
&self,
snapshot: &Snapshot,
created_by: TransactionId,
deleted_by: Option<TransactionId>,
) -> bool {
// If creator was aborted, row is invisible
if self.is_aborted(created_by) { return false; }
// If deleter was aborted, treat as not deleted
let effective_deleted_by = match deleted_by {
Some(del_by) if self.is_aborted(del_by) => None,
other => other,
};
snapshot.is_visible(created_by, effective_deleted_by)
}
```
Cleanup of old aborted transaction IDs:
```rust
// Remove aborted txn IDs older than any active transaction's snapshot
tm.cleanup_aborted_txns(older_than_txn_id);
```
## Transaction Lifecycle
```
1. BEGIN
├── Assign unique TransactionId
├── Create Snapshot (capture high_water_mark, in_progress set)
└── Register in active_txns map
2. READ (can repeat)
├── Check current version visibility using snapshot
├── If not visible, walk version chain
└── Return first visible version (or not found)
3. WRITE (can repeat, ReadWrite mode only)
├── Acquire write lock (blocks other writers)
├── Check for write-write conflict
├── Create new version with our txn_id as created_by
└── Move old version to version chain
4. COMMIT or ABORT
├── COMMIT:
│ ├── Log to WAL (if provided)
│ ├── Remove from active_txns
│ └── Update oldest_active watermark
└── ABORT:
├── Log to WAL (if provided)
├── Remove from active_txns
├── Add to aborted_txns set
└── Update oldest_active watermark
5. DROP (if not committed/aborted)
└── Auto-abort the transaction
```
## Usage Examples
### Basic Transaction Usage
```rust
use featherdb_mvcc::{TransactionManager, TransactionMode};
let tm = TransactionManager::new();
// Start a read-write transaction
let txn = tm.begin(TransactionMode::ReadWrite);
println!("Transaction ID: {:?}", txn.id());
// Check if we can write (will error for read-only transactions)
txn.check_write()?;
// Check visibility of another transaction's changes
if txn.can_see(some_txn_id) {
// Changes from some_txn_id are visible
}
// Commit the transaction
txn.commit()?;
```
### Read-Only Transaction
```rust
let txn = tm.begin_read_only();
// Read operations here...
// Attempting to write will fail
assert!(txn.check_write().is_err());
// Read-only transactions can commit or abort
txn.commit()?;
```
### Snapshot Visibility Check
```rust
let snapshot = txn.snapshot();
// Check if a version is visible
if snapshot.is_visible(created_by, deleted_by) {
// The row is visible to this transaction
}
```
### Running Garbage Collection
```rust
// Run GC manually
let stats = tm.run_gc();
println!("GC removed {} versions, ~{} bytes freed",
stats.versions_removed, stats.bytes_freed);
// Check version store size
println!("Remaining old versions: {}", tm.version_count());
println!("Active transactions: {}", tm.active_count());
println!("Tracked aborted transactions: {}", tm.aborted_count());
```
### WAL Integration
```rust
use featherdb_storage::Wal;
let mut wal = Wal::open("data/wal")?;
let txn = tm.begin_read_write();
// ... perform operations ...
// Commit with WAL logging
txn.commit_with_wal(&mut wal)?;
// Or abort with WAL logging
// txn.abort_with_wal(&mut wal)?;
```
### Savepoint/Sub-transaction Support
```rust
use std::collections::HashSet;
// Create a snapshot with visible ancestors (for sub-transactions)
let mut ancestors = HashSet::new();
ancestors.insert(parent_txn_id);
let snapshot = Snapshot::with_ancestors(
child_txn_id,
high_water_mark,
in_progress,
ancestors,
);
// The child snapshot can now see the parent's uncommitted changes
assert!(snapshot.can_see(parent_txn_id));
```
## Concurrency Guarantees
| Read-Read | No blocking - concurrent reads always allowed |
| Read-Write | No blocking - readers see their snapshot |
| Write-Read | No blocking - writer's uncommitted changes invisible to readers |
| Write-Write | Serialized - single writer lock prevents concurrent writes |
## Configuration Options
Currently, the MVCC layer uses sensible defaults. Configurable options include:
| Initial TxnID | 1 | Starting transaction ID |
| Write Lock | Mutex | Single-writer enforcement |
| Version Store | In-memory HashMap | Storage for old versions |
## Performance Considerations
1. **Version Chain Length**: Long chains slow down reads as they require traversal. Run GC regularly to keep chains short.
2. **Active Transaction Count**: Long-running transactions prevent GC from reclaiming old versions. Keep transactions short.
3. **Write Lock Contention**: All writes are serialized through a single lock. This simplifies implementation but can bottleneck write-heavy workloads.
4. **Aborted Transaction Set**: The aborted transaction set grows until cleanup. Call `cleanup_aborted_txns()` periodically.
5. **Snapshot Memory**: Each snapshot stores the set of in-progress transactions. Many concurrent transactions increase memory usage.
## Testing
```bash
# Run all MVCC tests (using Make)
make test-crate CRATE=featherdb-mvcc
# Or with cargo directly
cargo test -p featherdb-mvcc
# Test specific components
cargo test -p featherdb-mvcc snapshot
cargo test -p featherdb-mvcc transaction
cargo test -p featherdb-mvcc version
# Run with output
cargo test -p featherdb-mvcc -- --nocapture
# Run coverage (from project root)
make coverage # or: cargo llvm-cov --workspace
```
## Transaction Timeout
Transactions can be configured with automatic timeouts to prevent long-running transactions from blocking the system.
### Configuration
```rust
use featherdb_core::TransactionConfig;
use std::time::Duration;
// Create config with 5 second timeout
let config = TransactionConfig::new()
.with_timeout(Duration::from_secs(5).as_millis() as u64);
let txn = tm.begin_with_config(config);
```
### Timeout Behavior
When a transaction exceeds its configured timeout:
- All subsequent operations fail with `Error::TransactionTimeout`
- The transaction must be rolled back
- Timeout is checked automatically on every operation
### Timeout API
```rust
// Check if transaction has expired
if txn.is_expired() {
println!("Transaction timed out!");
}
// Get remaining time before timeout
if let Some(remaining) = txn.remaining_time() {
println!("{}ms remaining", remaining.as_millis());
}
// Get elapsed time since transaction started
let elapsed = txn.elapsed();
println!("Running for {}ms", elapsed.as_millis());
```
### Error Handling
```rust
match txn.execute("UPDATE accounts SET balance = 100") {
Ok(_) => println!("Success"),
Err(Error::TransactionTimeout { transaction_id, elapsed_ms, timeout_ms }) => {
println!("Transaction {} timed out after {}ms (limit: {}ms)",
transaction_id, elapsed_ms, timeout_ms);
txn.rollback()?;
},
Err(e) => println!("Other error: {}", e),
}
```
## Deadlock Detection
Deadlock detection prevents circular wait conditions between transactions using cycle detection in a wait-for graph.
### How It Works
1. **Wait Graph**: Tracks which transactions are waiting for which other transactions
2. **Cycle Detection**: Uses DFS to detect cycles in the wait graph
3. **Victim Selection**: Aborts the youngest transaction (highest ID) in the cycle
4. **Automatic Resolution**: Deadlocks are detected and resolved automatically
### WaitGraph API
```rust
use featherdb_mvcc::WaitGraph;
let mut graph = WaitGraph::new();
// Record that txn1 is waiting for txn2
graph.add_wait(txn1, txn2);
graph.add_wait(txn2, txn3);
graph.add_wait(txn3, txn1); // Creates a cycle
// Detect the cycle
if let Some(cycle) = graph.detect_cycle() {
println!("Deadlock detected: {:?}", cycle);
let victim = graph.select_victim(&cycle);
println!("Aborting transaction {}", victim);
}
// Clean up when transaction completes
graph.remove_transaction(txn1);
```
### Integration Example
```rust
// Deadlock detection is integrated into the transaction manager
// When a transaction waits for a lock:
wait_graph.add_wait(waiter_txn, holder_txn);
if let Some(cycle) = wait_graph.detect_cycle() {
let victim = wait_graph.select_victim(&cycle);
// Abort the victim transaction
tm.abort(victim)?;
return Err(Error::Deadlock);
}
```
## Long-Running Transaction Detection
Monitor long-running transactions to identify potential GC blockers and performance issues.
### Detection API
```rust
use std::time::Duration;
// Find transactions running longer than 5 seconds
let long_running = tm.long_running_transactions(Duration::from_secs(5));
for txn_info in &long_running {
println!("Transaction {}: running for {:?}", txn_info.id, txn_info.age);
}
```
### GC Status Monitoring
```rust
// Check if GC is blocked
let gc_status = tm.gc_status();
if gc_status.blocked {
println!("GC blocked by transaction {:?}", gc_status.oldest_active_txn);
println!("Reason: {:?}", gc_status.blocked_reason);
}
```
### TransactionInfo
```rust
pub struct TransactionInfo {
pub id: u64, // Transaction ID
pub age: Duration, // How long it has been running
}
```
### GcStatus
```rust
pub struct GcStatus {
pub oldest_active_txn: Option<u64>, // Oldest transaction blocking GC
pub blocked: bool, // Whether GC is blocked
pub blocked_reason: Option<String>, // Human-readable reason
}
```
### Use Cases
**Database Monitoring:**
```rust
// Periodic health check
let long_txns = tm.long_running_transactions(Duration::from_secs(5));
if !long_txns.is_empty() {
warn!("Found {} long-running transactions", long_txns.len());
}
```
**GC Health Check:**
```rust
let status = tm.gc_status();
if status.blocked {
info!("GC blocked: {}", status.blocked_reason.unwrap());
// Take action: notify admin, log warning, etc.
}
```
**Performance Analysis:**
```rust
// Correlate long-running transactions with GC blocking
let long_running = tm.long_running_transactions(Duration::from_secs(1));
let gc_status = tm.gc_status();
if let Some(oldest) = gc_status.oldest_active_txn {
if long_running.iter().any(|t| t.id == oldest) {
warn!("Long-running transaction {} is blocking GC!", oldest);
}
}
```
## Adaptive Garbage Collection
Adaptive GC automatically adjusts collection frequency based on workload patterns and metrics.
### Configuration
```rust
use featherdb_mvcc::gc::{GcConfig, GcScheduler};
let config = GcConfig {
min_interval_ms: 1000, // Minimum 1 second between GC runs
max_versions_before_gc: 10000, // Trigger GC at 10k versions
adaptive: true, // Enable adaptive scheduling
};
let mut scheduler = GcScheduler::new(config);
```
### GC Metrics
```rust
// Get current metrics
let metrics = scheduler.metrics().snapshot();
println!("Versions created: {}", metrics.versions_created);
println!("Versions cleaned: {}", metrics.versions_cleaned);
println!("GC runs: {}", metrics.gc_runs);
println!("GC time: {}μs", metrics.gc_time_us);
println!("Pending: {}", metrics.versions_pending);
```
### Adaptive Scheduling
```rust
// Check if GC should run based on adaptive heuristics
if scheduler.should_run_gc() {
let start = Instant::now();
let stats = tm.run_gc();
scheduler.record_gc_run(start.elapsed());
scheduler.record_versions_cleaned(stats.versions_removed as u64);
}
// Record version creation
scheduler.metrics().record_version_created();
```
### How Adaptive GC Works
1. **Metrics Tracking**: Monitors versions created, cleaned, and pending
2. **Heuristics**: Decides when to run GC based on:
- Time since last GC run (respects min_interval)
- Number of pending versions (respects max_versions_before_gc)
- Historical GC efficiency
3. **Performance**: Prevents unnecessary GC runs when system is idle
4. **Safety**: Ensures GC runs when memory pressure is high
## Future Improvements
- [ ] First-committer-wins write-write conflict detection
- [ ] Serializable isolation level (prevent write skew)
- [ ] Multi-version indexes
- [ ] Background GC thread with configurable interval
- [ ] Read-only transaction optimization (skip active tracking)