oxigdal-ha 0.1.3

High availability, disaster recovery, and automatic failover for OxiGDAL
Documentation
# OxiGDAL High Availability (HA)

High availability, disaster recovery, and automatic failover for OxiGDAL with 99.99% uptime target.

## Features

### πŸ”„ Active-Active Replication (~1,500 LOC)
- Asynchronous replication with batching
- Bi-directional sync between nodes
- Conflict-free replicated data types (CRDTs)
- Vector clocks for causality tracking
- Multiple replication topologies (star, mesh, tree)
- Bandwidth optimization with compression
- Replication lag monitoring

### ⚑ Automatic Failover (~1,200 LOC)
- Sub-second failover (< 1 second target)
- Heartbeat-based failure detection
- Raft-based leader election
- Automatic replica promotion
- Client traffic redirection
- Graceful degradation
- Automatic recovery and failback support

### πŸ”€ Conflict Resolution (~800 LOC)
- Last-write-wins (LWW) strategy
- Vector clock-based resolution
- Priority-based resolution
- Custom merge functions
- Manual resolution support
- Conflict audit trail

### πŸ’Ύ Point-in-Time Recovery (~1,000 LOC)
- WAL-based recovery system
- Snapshot management with compression
- Incremental recovery
- Configurable snapshot intervals
- RTO/RPO tracking

### πŸ“¦ Incremental Backups (~800 LOC)
- Full, incremental, and differential backups
- Backup compression (LZ4, Zstd, Gzip)
- Backup verification with checksums
- Retention policies
- Cloud backup integration ready

### 🌍 Disaster Recovery (~600 LOC)
- Cross-region replication
- Automated DR runbooks
- DR testing and validation
- RTO/RPO measurement
- Failover orchestration

### πŸ₯ Health Check System (~400 LOC)
- Liveness checks
- Readiness checks
- Dependency health monitoring
- Health aggregation
- HTTP endpoint ready

## Performance Targets

- **Uptime**: 99.99% (52 minutes downtime/year)
- **Failover Time**: < 1 second
- **RTO (Recovery Time Objective)**: < 5 minutes
- **RPO (Recovery Point Objective)**: < 1 minute

## Architecture

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      OxiGDAL HA System                      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚ Replication  β”‚  β”‚   Failover   β”‚  β”‚   Recovery   β”‚    β”‚
β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€    β”‚
β”‚  β”‚ Active-Activeβ”‚  β”‚ Detection    β”‚  β”‚ PITR         β”‚    β”‚
β”‚  β”‚ Protocol     β”‚  β”‚ Election     β”‚  β”‚ Snapshot     β”‚    β”‚
β”‚  β”‚ Lag Monitor  β”‚  β”‚ Promotion    β”‚  β”‚ WAL          β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚   Conflict   β”‚  β”‚    Backup    β”‚  β”‚      DR      β”‚    β”‚
β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€    β”‚
β”‚  β”‚ LWW          β”‚  β”‚ Full         β”‚  β”‚ Orchestrationβ”‚    β”‚
β”‚  β”‚ Vector Clock β”‚  β”‚ Incremental  β”‚  β”‚ Runbooks     β”‚    β”‚
β”‚  β”‚ Custom Merge β”‚  β”‚ Differential β”‚  β”‚ Testing      β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

## Usage Examples

### Basic Replication Setup

```rust
use oxigdal_ha::replication::{
    ActiveActiveReplication, ReplicationConfig, ReplicationManager,
    ReplicaNode, ReplicationState,
};
use uuid::Uuid;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create replication manager
    let node_id = Uuid::new_v4();
    let config = ReplicationConfig::default();
    let replication = ActiveActiveReplication::new(node_id, config);

    // Start replication
    replication.start().await?;

    // Add replica
    let replica = ReplicaNode {
        id: Uuid::new_v4(),
        name: "replica1".to_string(),
        address: "replica1.example.com:5000".to_string(),
        priority: 100,
        state: ReplicationState::Active,
        last_replicated_at: None,
        lag_ms: None,
    };
    replication.add_replica(replica).await?;

    // Replicate data
    let event = ReplicationEvent::new(
        node_id,
        replica.id,
        vec![1, 2, 3, 4, 5],
        1,
    );
    replication.replicate(event).await?;

    Ok(())
}
```

### Automatic Failover

```rust
use oxigdal_ha::failover::{
    detection::FailureDetector,
    election::LeaderElection,
    promotion::ReplicaPromotion,
    FailoverConfig, PromotionStrategy,
};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let config = FailoverConfig::default();
    
    // Start failure detection
    let detector = FailureDetector::new(config.clone());
    detector.start().await?;

    // Setup leader election
    let node_id = Uuid::new_v4();
    let election = LeaderElection::new(node_id, 100, config.clone());
    
    // On failure, start election
    let result = election.start_election().await?;
    println!("New leader elected: {}", result.winner_id);

    Ok(())
}
```

### Point-in-Time Recovery

```rust
use oxigdal_ha::recovery::{
    pitr::PitrManager,
    RecoveryConfig,
    RecoveryTarget,
};
use std::path::PathBuf;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let config = RecoveryConfig::default();
    let data_dir = PathBuf::from("/var/lib/oxigdal/data");
    
    let manager = PitrManager::new(config, data_dir);
    
    // Recover to latest state
    let result = manager.recover(RecoveryTarget::Latest).await?;
    
    println!(
        "Recovery complete: {} transactions replayed in {}ms",
        result.transactions_replayed,
        result.duration_ms
    );

    Ok(())
}
```

### Disaster Recovery

```rust
use oxigdal_ha::dr::{
    orchestration::DrOrchestrator,
    DrConfig,
};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let config = DrConfig {
        primary_region: "us-east-1".to_string(),
        dr_region: "us-west-2".to_string(),
        rto_seconds: 300,
        rpo_seconds: 60,
        enable_auto_failover: false,
    };
    
    let orchestrator = DrOrchestrator::new(config);
    
    // Execute DR failover
    let result = orchestrator.execute_failover().await?;
    
    println!(
        "DR failover complete: {} -> {} in {}s",
        result.old_primary,
        result.new_primary,
        result.rto_achieved_seconds
    );

    Ok(())
}
```

## Testing

```bash
# Run all tests
cargo test

# Run specific test suites
cargo test --test replication_test
cargo test --test failover_test
cargo test --test recovery_test

# Run benchmarks
cargo bench
```

## Benchmarks

Performance benchmarks for key operations:

```bash
cargo bench --bench ha_bench
```

Benchmark results:
- **Replication Throughput**: 10,000+ events/second
- **Failover Latency**: < 1 second
- **Recovery Time**: Varies by data size

## COOLJAPAN Compliance

βœ… **Pure Rust** - No C/Fortran dependencies  
βœ… **No unwrap()** - All error handling uses Result types  
βœ… **Files < 2000 lines** - All source files are well-structured  
βœ… **Workspace dependencies** - Uses workspace-level dependency management  

## Implementation Statistics

- **Total Lines of Code**: ~5,655 LOC
- **Core Implementation**: ~4,020 LOC
- **Source Files**: 35 Rust files
- **Test Files**: 6 comprehensive test suites
- **Benchmarks**: Performance benchmarks included

## Module Breakdown

| Module | LOC | Description |
|--------|-----|-------------|
| Replication | ~1,500 | Active-active replication |
| Failover | ~1,200 | Automatic failover |
| Conflict | ~800 | Conflict resolution |
| Recovery | ~1,000 | Point-in-time recovery |
| Backup | ~800 | Incremental backups |
| DR | ~600 | Disaster recovery |
| Health Check | ~400 | Health monitoring |
| Error | ~100 | Error types |
| Lib | ~50 | Library root |

## License

Apache-2.0

## Authors

COOLJAPAN OU (Team Kitasan)

## Repository

https://github.com/cool-japan/oxigdal