OxiGDAL High Availability (HA)

High availability, disaster recovery, and automatic failover for OxiGDAL with 99.99% uptime target.

Features

🔄 Active-Active Replication (~1,500 LOC)

Asynchronous replication with batching
Bi-directional sync between nodes
Conflict-free replicated data types (CRDTs)
Vector clocks for causality tracking
Multiple replication topologies (star, mesh, tree)
Bandwidth optimization with compression
Replication lag monitoring

⚡ Automatic Failover (~1,200 LOC)

Sub-second failover (< 1 second target)
Heartbeat-based failure detection
Raft-based leader election
Automatic replica promotion
Client traffic redirection
Graceful degradation
Automatic recovery and failback support

🔀 Conflict Resolution (~800 LOC)

Last-write-wins (LWW) strategy
Vector clock-based resolution
Priority-based resolution
Custom merge functions
Manual resolution support
Conflict audit trail

💾 Point-in-Time Recovery (~1,000 LOC)

WAL-based recovery system
Snapshot management with compression
Incremental recovery
Configurable snapshot intervals
RTO/RPO tracking

📦 Incremental Backups (~800 LOC)

Full, incremental, and differential backups
Backup compression (LZ4, Zstd, Gzip)
Backup verification with checksums
Retention policies
Cloud backup integration ready

🌍 Disaster Recovery (~600 LOC)

Cross-region replication
Automated DR runbooks
DR testing and validation
RTO/RPO measurement
Failover orchestration

🏥 Health Check System (~400 LOC)

Liveness checks
Readiness checks
Dependency health monitoring
Health aggregation
HTTP endpoint ready

Performance Targets

Uptime: 99.99% (52 minutes downtime/year)
Failover Time: < 1 second
RTO (Recovery Time Objective): < 5 minutes
RPO (Recovery Point Objective): < 1 minute

Architecture

┌─────────────────────────────────────────────────────────────┐
│                      OxiGDAL HA System                      │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐    │
│  │ Replication  │  │   Failover   │  │   Recovery   │    │
│  ├──────────────┤  ├──────────────┤  ├──────────────┤    │
│  │ Active-Active│  │ Detection    │  │ PITR         │    │
│  │ Protocol     │  │ Election     │  │ Snapshot     │    │
│  │ Lag Monitor  │  │ Promotion    │  │ WAL          │    │
│  └──────────────┘  └──────────────┘  └──────────────┘    │
│                                                             │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐    │
│  │   Conflict   │  │    Backup    │  │      DR      │    │
│  ├──────────────┤  ├──────────────┤  ├──────────────┤    │
│  │ LWW          │  │ Full         │  │ Orchestration│    │
│  │ Vector Clock │  │ Incremental  │  │ Runbooks     │    │
│  │ Custom Merge │  │ Differential │  │ Testing      │    │
│  └──────────────┘  └──────────────┘  └──────────────┘    │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Usage Examples

Basic Replication Setup

use oxigdal_ha::replication::{
    ActiveActiveReplication, ReplicationConfig, ReplicationManager,
    ReplicaNode, ReplicationState,
};
use uuid::Uuid;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create replication manager
    let node_id = Uuid::new_v4();
    let config = ReplicationConfig::default();
    let replication = ActiveActiveReplication::new(node_id, config);

    // Start replication
    replication.start().await?;

    // Add replica
    let replica = ReplicaNode {
        id: Uuid::new_v4(),
        name: "replica1".to_string(),
        address: "replica1.example.com:5000".to_string(),
        priority: 100,
        state: ReplicationState::Active,
        last_replicated_at: None,
        lag_ms: None,
    };
    replication.add_replica(replica).await?;

    // Replicate data
    let event = ReplicationEvent::new(
        node_id,
        replica.id,
        vec![1, 2, 3, 4, 5],
        1,
    );
    replication.replicate(event).await?;

    Ok(())
}

Automatic Failover

use oxigdal_ha::failover::{
    detection::FailureDetector,
    election::LeaderElection,
    promotion::ReplicaPromotion,
    FailoverConfig, PromotionStrategy,
};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let config = FailoverConfig::default();
    
    // Start failure detection
    let detector = FailureDetector::new(config.clone());
    detector.start().await?;

    // Setup leader election
    let node_id = Uuid::new_v4();
    let election = LeaderElection::new(node_id, 100, config.clone());
    
    // On failure, start election
    let result = election.start_election().await?;
    println!("New leader elected: {}", result.winner_id);

    Ok(())
}

Point-in-Time Recovery

use oxigdal_ha::recovery::{
    pitr::PitrManager,
    RecoveryConfig,
    RecoveryTarget,
};
use std::path::PathBuf;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let config = RecoveryConfig::default();
    let data_dir = PathBuf::from("/var/lib/oxigdal/data");
    
    let manager = PitrManager::new(config, data_dir);
    
    // Recover to latest state
    let result = manager.recover(RecoveryTarget::Latest).await?;
    
    println!(
        "Recovery complete: {} transactions replayed in {}ms",
        result.transactions_replayed,
        result.duration_ms
    );

    Ok(())
}

Disaster Recovery

use oxigdal_ha::dr::{
    orchestration::DrOrchestrator,
    DrConfig,
};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let config = DrConfig {
        primary_region: "us-east-1".to_string(),
        dr_region: "us-west-2".to_string(),
        rto_seconds: 300,
        rpo_seconds: 60,
        enable_auto_failover: false,
    };
    
    let orchestrator = DrOrchestrator::new(config);
    
    // Execute DR failover
    let result = orchestrator.execute_failover().await?;
    
    println!(
        "DR failover complete: {} -> {} in {}s",
        result.old_primary,
        result.new_primary,
        result.rto_achieved_seconds
    );

    Ok(())
}

Testing

# Run all tests
cargo test

# Run specific test suites
cargo test --test replication_test
cargo test --test failover_test
cargo test --test recovery_test

# Run benchmarks
cargo bench

Benchmarks

Performance benchmarks for key operations:

cargo bench --bench ha_bench

Benchmark results:

Replication Throughput: 10,000+ events/second
Failover Latency: < 1 second
Recovery Time: Varies by data size

COOLJAPAN Compliance

✅ Pure Rust - No C/Fortran dependencies
✅ No unwrap() - All error handling uses Result types
✅ Files < 2000 lines - All source files are well-structured
✅ Workspace dependencies - Uses workspace-level dependency management

Implementation Statistics

Total Lines of Code: ~5,655 LOC
Core Implementation: ~4,020 LOC
Source Files: 35 Rust files
Test Files: 6 comprehensive test suites
Benchmarks: Performance benchmarks included

Module Breakdown

Module	LOC	Description
Replication	~1,500	Active-active replication
Failover	~1,200	Automatic failover
Conflict	~800	Conflict resolution
Recovery	~1,000	Point-in-time recovery
Backup	~800	Incremental backups
DR	~600	Disaster recovery
Health Check	~400	Health monitoring
Error	~100	Error types
Lib	~50	Library root

License

Apache-2.0

Authors

COOLJAPAN OU (Team Kitasan)

Repository

https://github.com/cool-japan/oxigdal

oxigdal-ha 0.1.3