# OxiRS Cluster Performance Tuning Guide
**Version:** 0.2.3
**Last Updated:** 2026-03-05
## Overview
This guide provides comprehensive recommendations for optimizing OxiRS Cluster performance across different deployment scenarios, from small development clusters to large-scale production environments.
## Table of Contents
1. [Quick Start](#quick-start)
2. [Cloud Storage Optimization](#cloud-storage-optimization)
3. [Disaster Recovery Tuning](#disaster-recovery-tuning)
4. [Elastic Scaling Configuration](#elastic-scaling-configuration)
5. [ML Cost Optimization](#ml-cost-optimization)
6. [Monitoring & Metrics](#monitoring--metrics)
7. [Troubleshooting](#troubleshooting)
---
## Quick Start
### Development Environment (< 10 nodes)
```toml
[cluster]
min_nodes = 3
max_nodes = 5
target_cpu_utilization = 0.70
[storage]
default_tier = "hot"
encryption_enabled = false # Disable for development
compression_level = 3
[replication]
replication_factor = 1 # Single copy for development
sync_writes = false
```
### Production Environment (10-100 nodes)
```toml
[cluster]
min_nodes = 10
max_nodes = 100
target_cpu_utilization = 0.75
[storage]
default_tier = "hot"
encryption_enabled = true
compression_level = 6
[replication]
replication_factor = 3
sync_writes = true
rto_seconds = 300 # 5 minutes
rpo_seconds = 60 # 1 minute
```
### Large-Scale Production (100+ nodes)
```toml
[cluster]
min_nodes = 100
max_nodes = 1000
target_cpu_utilization = 0.80
[storage]
default_tier = "warm" # Cost optimization
encryption_enabled = true
compression_level = 9
[replication]
replication_factor = 5
sync_writes = false # Async for performance
batch_size = 1000
```
---
## Cloud Storage Optimization
### S3 Backend Tuning
**Upload Performance:**
- Use `StorageTier::Hot` for frequently accessed data
- Enable multipart upload for files > 5MB
- Set appropriate upload buffer size (default: 8MB)
```rust
let config = CloudStorageConfig {
provider: CloudProvider::AWS,
region: "us-east-1".to_string(),
bucket: "oxirs-production".to_string(),
default_tier: StorageTier::Hot,
encryption_enabled: true,
versioning_enabled: false, // Disable if not needed
lifecycle_rules: vec![
LifecycleRule {
id: "transition_to_cold".to_string(),
transition_days: 90,
target_tier: StorageTier::Cold,
enabled: true,
},
],
};
```
**Download Performance:**
- Enable connection pooling
- Use parallel downloads for large objects
- Consider CloudFront/CDN for frequently accessed data
**Cost Optimization:**
- Use `StorageTier::Warm` for infrequently accessed data (70% cost savings)
- Use `StorageTier::Cold` for archival (90% cost savings)
- Implement lifecycle policies to automatically transition data
### GCS Backend Tuning
**Regional Optimization:**
```rust
// Choose region closest to compute resources
let config = CloudStorageConfig {
provider: CloudProvider::GCP,
region: "us-central1".to_string(), // Same region as compute
// ... other config
};
```
**Performance Tiers:**
- Standard: Default for hot data
- Nearline: Data accessed < 1/month (40% cheaper)
- Coldline: Data accessed < 1/quarter (70% cheaper)
- Archive: Long-term archival (90% cheaper)
### Azure Blob Storage Tuning
**Access Tiers:**
- Hot: Optimized for frequent access
- Cool: Infrequent access (50% storage cost savings)
- Archive: Rare access (95% storage cost savings)
**Performance:**
- Enable Azure CDN for global distribution
- Use premium block blobs for low-latency requirements
---
## Disaster Recovery Tuning
### RTO/RPO Configuration
**Aggressive (Mission-Critical):**
```rust
let config = DisasterRecoveryConfig {
rto_seconds: 60, // 1 minute recovery
rpo_seconds: 10, // 10 seconds data loss max
auto_failover_enabled: true,
health_check_interval_secs: 10,
failover_threshold: 2, // Fail after 2 checks
continuous_replication: true,
replication_batch_size: 100,
..Default::default()
};
```
**Balanced (Standard Production):**
```rust
let config = DisasterRecoveryConfig {
rto_seconds: 300, // 5 minutes recovery
rpo_seconds: 60, // 1 minute data loss max
auto_failover_enabled: true,
health_check_interval_secs: 30,
failover_threshold: 3,
continuous_replication: true,
replication_batch_size: 1000,
..Default::default()
};
```
**Cost-Optimized (Non-Critical):**
```rust
let config = DisasterRecoveryConfig {
rto_seconds: 1800, // 30 minutes recovery
rpo_seconds: 300, // 5 minutes data loss max
auto_failover_enabled: false, // Manual failover
health_check_interval_secs: 120,
continuous_replication: false,
replication_batch_size: 10000,
..Default::default()
};
```
### Multi-Cloud Strategy
**Primary-Secondary Pattern:**
```
AWS (Primary) -> GCP (Secondary) -> Azure (Tertiary)
```
**Active-Active Pattern (High Availability):**
```
AWS (Active) <-> GCP (Active)
- Both serving traffic
- Bi-directional replication
- Geographic load balancing
```
---
## Elastic Scaling Configuration
### Auto-Scaling Thresholds
**Conservative (Cost-Optimized):**
```rust
let config = ElasticScalingConfig {
min_nodes: 5,
max_nodes: 20,
target_cpu_utilization: 0.80, // Higher threshold
target_memory_utilization: 0.85,
scale_up_threshold: 0.85,
scale_down_threshold: 0.40, // More aggressive scale-down
cooldown_seconds: 600, // 10 minutes
use_spot_instances: true,
max_spot_ratio: 0.70, // 70% spot instances
..Default::default()
};
```
**Aggressive (Performance-Optimized):**
```rust
let config = ElasticScalingConfig {
min_nodes: 10,
max_nodes: 200,
target_cpu_utilization: 0.60, // Lower threshold - more headroom
target_memory_utilization: 0.65,
scale_up_threshold: 0.70,
scale_down_threshold: 0.30,
cooldown_seconds: 180, // 3 minutes - faster response
use_spot_instances: true,
max_spot_ratio: 0.30, // Lower ratio for stability
..Default::default()
};
```
### Spot Instance Optimization
**Maximum Cost Savings:**
- Use 70-80% spot instances
- Multiple instance types for availability
- Implement graceful shutdown handlers
**Balanced Performance/Cost:**
- Use 50% spot instances
- Critical nodes on on-demand
- Maintain hot standbys
**High Availability:**
- Use 20-30% spot instances
- All critical paths on on-demand
- Spot for batch/background jobs only
---
## ML Cost Optimization
### Training Data Collection
```rust
let optimizer = MLCostOptimizer::new();
// Collect comprehensive training data
for metric in historical_metrics {
optimizer.add_training_data(CostTrainingData {
instance_type: metric.instance_type.clone(),
cpu_utilization: metric.cpu,
memory_utilization: metric.memory,
queries_per_second: metric.qps,
actual_cost: metric.hourly_cost,
is_spot: metric.is_spot_instance,
timestamp: metric.timestamp,
}).await;
}
// Train model (requires >= 100 samples)
optimizer.train_model().await?;
```
### Cost Predictions
```rust
// Get cost prediction with confidence
let prediction = optimizer.predict_cost(¤t_metrics, &config).await;
if prediction.confidence > 0.8 {
println!("High confidence prediction:");
println!(" Hourly cost: ${:.2}", prediction.predicted_hourly_cost);
println!(" Monthly savings: ${:.2}", prediction.estimated_monthly_savings);
println!(" Recommended instance: {}", prediction.recommended_instance_type);
println!(" Recommended spot ratio: {:.1}%", prediction.recommended_spot_ratio * 100.0);
}
```
### Cost Optimization Recommendations
```rust
let recommendations = optimizer.get_recommendations(&status, &cost_optimization).await;
for rec in recommendations {
if rec.ml_based && rec.confidence > 0.75 {
println!("[{}] {}", rec.impact, rec.action);
println!(" Potential savings: ${:.2}/month", rec.predicted_savings);
println!(" {}", rec.description);
}
}
```
---
## Monitoring & Metrics
### Key Performance Indicators
**Cluster Health:**
- Node availability: > 99.9%
- Replication lag: < 1 second
- Query latency p99: < 100ms
- Error rate: < 0.1%
**Storage Metrics:**
- Upload throughput: Monitor for degradation
- Download latency: p95 < 50ms
- Compression ratio: Track over time
- Storage costs: $ per GB per month
**Disaster Recovery:**
- Failover time: Should match RTO
- Data loss: Should match RPO
- Health check success rate: > 99%
- Replication throughput: Bytes/sec
### Prometheus Integration
```rust
let profiler = CloudOperationProfiler::new();
// Operations are automatically tracked
profiler.start_operation("s3_upload");
// ... perform upload ...
profiler.stop_operation("s3_upload", bytes_uploaded, success);
// Export for Prometheus
let metrics = profiler.export_prometheus();
```
### Custom Metrics
```rust
let summary = backend.get_metrics_summary();
// Track custom metrics
metrics_registry.record("oxirs_uploads_total", summary.total_uploads);
metrics_registry.record("oxirs_upload_bytes", summary.total_upload_bytes);
metrics_registry.record("oxirs_avg_latency_ms", summary.avg_latency_ms);
metrics_registry.record("oxirs_compression_ratio", summary.compression_ratio);
```
---
## Troubleshooting
### High Latency
**Symptoms:** Query latency > 100ms p99
**Diagnosis:**
1. Check replication lag: `cluster.get_replication_lag()`
2. Monitor CPU/memory: Should be < 80%
3. Check network latency between regions
4. Review slow query logs
**Solutions:**
- Scale up if CPU/memory high
- Add read replicas for read-heavy workloads
- Use closer cloud regions
- Enable query result caching
### High Costs
**Symptoms:** Monthly cloud bill exceeding budget
**Diagnosis:**
1. Review storage tier distribution
2. Check spot instance ratio
3. Analyze data transfer costs
4. Review unused resources
**Solutions:**
```rust
// Get cost optimization recommendations
let recommendations = manager.get_cost_optimization().await;
for rec in recommendations.recommendations {
println!("{}: {}", rec.priority, rec.recommendation);
}
```
- Increase spot instance usage to 50-70%
- Implement lifecycle policies for cold storage
- Use regional endpoints to reduce data transfer
- Right-size instance types based on actual usage
### Replication Lag
**Symptoms:** RPO violations, stale reads
**Diagnosis:**
1. Check network bandwidth between regions
2. Monitor replication queue depth
3. Review batch sizes
4. Check for network partitions
**Solutions:**
- Increase replication batch size
- Add more network bandwidth
- Use async replication for non-critical data
- Enable compression for replication traffic
### Failover Issues
**Symptoms:** Failover takes longer than RTO
**Diagnosis:**
1. Check health check frequency
2. Review failover threshold
3. Monitor provider availability
4. Check DNS propagation time
**Solutions:**
- Decrease health check interval
- Lower failover threshold
- Pre-warm secondary providers
- Use global load balancer for faster DNS updates
---
## Performance Benchmarks
### Cloud Storage Operations
Typical performance (measured with criterion):
```
S3 Upload (1MB): ~50ms (20 MB/s)
S3 Download (1MB): ~30ms (33 MB/s)
GCS Upload (1MB): ~55ms (18 MB/s)
Azure Upload (1MB): ~60ms (17 MB/s)
GPU Compression (1MB): ~10ms (100 MB/s)
CPU Compression (1MB): ~40ms (25 MB/s)
```
### Scaling Operations
```
Disaster Recovery Failover: <30 seconds
Elastic Scaling Decision: <1 second
ML Cost Prediction: <500ms
Health Check (3 providers): <2 seconds
```
---
## Best Practices
1. **Start with conservative settings** and tune based on metrics
2. **Enable monitoring** from day one
3. **Test disaster recovery** regularly (quarterly minimum)
4. **Review costs weekly** and adjust spot ratios
5. **Train ML models** with >= 1000 samples for best predictions
6. **Use property-based testing** for configuration validation
7. **Monitor compression ratios** and adjust levels based on data types
8. **Implement gradual rollouts** for configuration changes
9. **Keep RTO/RPO** aligned with business requirements
10. **Document all tuning changes** and their impact
---
## Additional Resources
- [OxiRS Cluster README](../README.md)
- [Cloud Integration API](../src/cloud_integration.rs)
- [ML Optimization Guide](../src/ml_optimization.rs)
- [Property-Based Tests](../tests/property_based_tests.rs)
For questions or issues, please file an issue at:
https://github.com/cool-japan/oxirs/issues