# Raft Throughput Optimization Guide for Tonic gRPC in Rust
This guide provides empirically-tuned strategies to improve Raft performance using [tonic](https://docs.rs/tonic/latest/tonic/) with connection type isolation and configurable persistence. These optimizations address critical bottlenecks in consensus systems where network transport and disk I/O impact throughput and latency.
## Connection Type Strategy
We implement three distinct connection types to prevent head-of-line blocking:
| `Control` | Consensus operations | Heartbeats, Votes, Config changes | Latency-sensitive (sub-ms) |
| `Data` | Log replication | AppendEntries, Log writes | Throughput-optimized |
| `Bulk` | Large transfers | Snapshot streaming | Bandwidth-intensive |
```rust,ignore
pub(crate) enum ConnectionType {
Control, // Elections/heartbeats
Data, // Log replication
Bulk, // Snapshot transmission
}
```
## Persistence Strategy & Throughput/Latency Trade-offs
The choice of persistence strategy (`DiskFirst` vs. `MemFirst`) and `FlushPolicy` is the primary mechanism for balancing write throughput against write latency and durability guarantees.
### Strategy Configuration
Configure the strategy in your `config.toml`:
```toml
[raft.persistence]
# Choose based on your priority: durability (DiskFirst) or low-latency writes (MemFirst)
# DEFAULT in v0.2.3+: DiskFirst (Raft protocol compliance)
strategy = "DiskFirst" # Default in v0.2.3+
# For benchmark/development: strategy = "MemFirst"
# Flush policy works in conjunction with the chosen strategy.
flush_policy = { Batch = { threshold = 1000, interval_ms = 100 } }
# flush_policy = "Immediate"
```
### `DiskFirst` Strategy (High Durability)
- **Write Path**: An entry is written to disk _before_ being acknowledged and added to the in-memory store.
- **Latency/Throughput**: Higher, more consistent write latency per operation. Maximum throughput may be limited by disk write speed.
- **Durability**: Strong. Acknowledged entries are always on disk. No data loss on process crash.
- **Use Case**: Systems where data integrity is the absolute highest priority, even at the cost of higher write latency.
### `MemFirst` Strategy (High Throughput / Low Latency)
- **Write Path**: An entry is immediately added to the in-memory store and acknowledged. Disk persistence happens asynchronously.
- **Latency/Throughput**: Very low write latency and high throughput, as writes are batched for disk I/O.
- **Durability**: Eventual. A process crash can lose the most recently acknowledged entries that haven't been flushed yet.
- **Use Case**: Systems requiring high write performance and low latency, where a small window of potential data loss is acceptable.
### `FlushPolicy` Tuning
The flush policy controls how asynchronous (`MemFirst`) writes are persisted to disk.
- **`Immediate`**: Flush every write synchronously. Use with `MemFirst` to approximate `DiskFirst` durability (but with different failure semantics) or with `DiskFirst` for the strongest guarantees. Highest durability, lowest performance.
- **`Batch { threshold, interval_ms }`**:
- `threshold`: Flush when this many entries are pending. Larger values increase throughput but also the window of potential data loss.
- `interval_ms`: Flush after this much time has passed, even if the threshold isn't met. Limits the maximum staleness of data on disk.
**Recommendation**: For most production use cases, `MemFirst` with a `Batch` policy offers the best balance of performance and durability. Tune the `threshold` based on your acceptable data loss window (RPO) and the `interval_ms` based on your crash recovery tolerance.
## Configuration Tuning
### Control Plane (`[network.control]`)
```toml
concurrency_limit = 8192
max_concurrent_streams = 500
connection_window_size = 2_097_152 # 2MB
http2_keep_alive_timeout_in_secs = 20 # Aggressive timeout
```
**Why it matters:**
- Ensures election timeouts aren't missed during load
- Prevents heartbeat delays that cause unnecessary leader changes
- Uses smaller windows for faster roundtrips
### Data Plane (`[network.data]`)
```toml
max_concurrent_streams = 500
connection_window_size = 6_291_456 # 6MB
request_timeout_in_ms = 200 # Batch-friendly
```
**Optimization rationale:**
- Larger windows accommodate log batches
- Stream count aligns with typical pipelining depth
- Timeout tuned for batch processing, not individual entries
### Bulk Plane (`[network.bulk]` - Recommended)
```toml
# SNAPSHOT-SPECIFIC SETTINGS (EXAMPLE)
connect_timeout_in_ms = 1000 # Slow-start connections
request_timeout_in_ms = 30000 # 30s for large transfers
concurrency_limit = 4 # Few concurrent streams
connection_window_size = 33_554_432 # 32MB window
```
**Snapshot considerations:**
- Requires 10-100x larger windows than data plane
- Higher timeouts for GB-range transfers
- Compression essential (Gzip enabled in implementation)
## Critical Code Implementation
### Connection Type Routing
```rust,ignore
// Control operations
membership.get_peer_channel(peer_id, ConnectionType::Control)
// Data operations
membership.get_peer_channel(peer_id, ConnectionType::Data)
// Snapshot transfers
membership.get_peer_channel(leader_id, ConnectionType::Bulk)
```
### gRPC Server Tuning
```rust,ignore
tonic::transport::Server::builder()
.timeout(Duration::from_millis(control_config.request_timeout_in_ms))
.max_concurrent_streams(control_config.max_concurrent_streams)
.max_frame_size(Some(data_config.max_frame_size))
.initial_connection_window_size(data_config.connection_window_size)
```
## Performance Results (Optimization Impact)
| Throughput | 368 ops/sec | 373 ops/sec | +1.3% |
| p99 Latency | 5543 µs | 4703 µs | **-15.2%** |
| p99.9 Latency | 14015 µs | 11279 µs | -19.5% |
> **Key improvement**: 15% reduction in tail latency - critical for consensus stability
> **Note**: These metrics show the impact of connection pooling optimization. These results can be further improved by tuning the PersistenceStrategy for your specific workload.
>
> For absolute performance benchmarks, see [v0.2.3 Performance Report](https://github.com/deventlab/d-engine/tree/main/benches/reports/v0.2.3/bench_report_v0.2.3.md)
## Operational Recommendations
1. **Pre-warm connections** during node initialization
2. **Monitor connection types separately**:
```bash
netstat -an | grep ":9081" | grep ESTABLISHED | wc -l
netstat -an | grep ":9082" | grep ESTABLISHED
```
3. **Size bulk windows** for snapshot sizes:
```rust,ignore
connection_window_size = max_snapshot_size * 1.2
```
4. **Compress snapshots**:
```rust,ignore
.send_compressed(CompressionEncoding::Gzip)
.accept_compressed(CompressionEncoding::Gzip)
```
5. **Monitor Flush Lag**: When using `MemFirst`, monitor the difference between `last_log_index` and `durable_index`. A growing gap indicates the disk is not keeping up with writes, increasing potential data loss.
## Anti-Patterns to Avoid
```rust,ignore
// DON'T: Use same connection for control and data
get_peer_channel(peer_id, ConnectionType::Data).await?;
client.request_vote(...) // Control operation on data channel
// DO: Strict separation
get_peer_channel(peer_id, ConnectionType::Control).await?;
client.request_vote(...)
// DON'T: Use MemFirst with Immediate flush for high-throughput scenarios
// This eliminates the performance benefit of MemFirst.
[strategy = "MemFirst"]
flush_policy = "Immediate" // Anti-pattern
// DO: Use a Batch policy to amortize disk I/O cost.
[strategy = "MemFirst"]
flush_policy = { Batch = { threshold = 1000, interval_ms = 100 } }
```
## Why Connection Isolation and Strategy Choice Matters
1. **Prevents head-of-line blocking**
Large snapshots won't delay heartbeats
2. **Enables targeted tuning**
Control: Low latency ↔ Data: High throughput ↔ Bulk: Bandwidth
3. **Improves fault containment**
Connection issues affect only one operation type
4. **Decouples Performance from Durability**
The `PersistenceStrategy` allows you to choose the right trade-off between fast writes (`MemFirst`) and guaranteed durability (`DiskFirst`) for your use case.
## Reference Deployment Configurations
Below are example configurations for different deployment scenarios.
Adjust values based on snapshot size, log append rate, and cluster size.
### 1. Single Node (Local Dev / Testing)
- CPU: 4 cores
• Memory: 8 GB
• Network: Localhost
```toml
[raft.persistence]
strategy = "DiskFirst" # Default in v0.2.3+
# For benchmark/development: strategy = "MemFirst"
flush_policy = { Batch = { threshold = 500, interval_ms = 50 } }
[network.control]
concurrency_limit = 10
max_concurrent_streams = 64
connection_window_size = 1_048_576 # 1MB
[network.data]
concurrency_limit = 20
max_concurrent_streams = 128
connection_window_size = 2_097_152 # 2MB
[network.bulk]
concurrency_limit = 2
connection_window_size = 8_388_608 # 8MB
request_timeout_in_ms = 10_000 # 10s
```
**Tip**: Single-node setups focus on low resource usage; bulk window size can be smaller since snapshots are local. For developer iteration speed, consider using `strategy = "MemFirst"`.
### 2. 3-Node Public Cloud Cluster (Medium Durability)
- Instance Type: 4 vCPU / 16 GB RAM (e.g., AWS m6i.large, GCP n2-standard-4)
• Network: 10 Gbps
• Priority: Balanced throughput and durability
```toml
[raft.persistence]
strategy = "DiskFirst" # Default in v0.2.3+
# For high-throughput scenarios: strategy = "MemFirst"
flush_policy = { Batch = { threshold = 2000, interval_ms = 200 } }
[network.control]
concurrency_limit = 20
max_concurrent_streams = 128
connection_window_size = 2_097_152 # 2MB
[network.data]
concurrency_limit = 30
max_concurrent_streams = 256
connection_window_size = 4_194_304 # 4MB
[network.bulk]
concurrency_limit = 4
connection_window_size = 33_554_432 # 32MB
request_timeout_in_ms = 30_000 # 30s for multi-GB snapshots
```
**Tip**: For public cloud, moderate concurrency and 32MB bulk windows ensure stable snapshot streaming without affecting heartbeats. The batch policy is tuned for high throughput with a reasonable data loss window.
### 3. 5-Node High-Durability Cluster (Production)
- Instance Type: 8 vCPU / 32 GB RAM (e.g., AWS m6i.xlarge)
• Network: 25 Gbps
• Priority: Data Integrity over Write Latency
```toml
[raft.persistence]
strategy = "DiskFirst" # Highest durability guarantee
# flush_policy is less critical for DiskFirst but can help with batching internal operations.
[network.control]
concurrency_limit = 50
max_concurrent_streams = 256
connection_window_size = 4_194_304 # 4MB
[network.data]
concurrency_limit = 80
max_concurrent_streams = 512
connection_window_size = 8_388_608 # 8MB
[network.bulk]
concurrency_limit = 8
connection_window_size = 67_108_864 # 64MB
request_timeout_in_ms = 60_000 # 60s for large snapshots
```
**Tip**: Use `DiskFirst` for systems where data integrity is non-negotiable (e.g., financial systems). Be aware that write throughput will be bound by disk I/O latency.
## Network Environment Tuning Recommendations
These parameters are primarily **network-dependent**, not CPU/memory dependent.
Adjust them based on latency, packet loss, and connection stability.
| **Local / In-Cluster (LAN)** | 60 | 10 | 5 | 16_777_215 (16MB) | Low latency & stable; defaults are fine |
| **Cross-Region / Stable WAN** | 60 | 15 | 8 | 16_777_215 (16MB) | Slightly longer keep-alive to avoid false disconnects |
| **Public Cloud / Moderate Loss** | 60 | 20 | 10 | 33_554_432 (32MB) | Higher interval & timeout for lossy links; larger frame helps batch logs |
| **High Latency / Unstable WAN** | 120 | 30 | 15 | 33_554_432 (32MB) | Longer timeouts prevent spurious drops |
**Guidelines:**
1. Keep-alive interval ≈ 1/3 of timeout.
2. Increase `max_frame_size` only if batch logs or snapshots exceed 16MB.
3. High-latency WAN: favor fewer reconnects over aggressive failure detection.
4. These settings are independent of CPU and memory; focus on network RTT and stability.
### RPC Timeout Guidance
`connect_timeout_in_ms` and `request_timeout_in_ms` depend on **network latency and I/O**, not CPU or memory.
| **Local / In-Cluster (LAN)** | 50–100 | 100–300 | Very low RTT; fast retries |
| **Cross-Region / Stable WAN** | 200–500 | 300–1000 | Higher RTT, moderate batch sizes |
| **Public Cloud / Moderate Loss** | 500–1000 | 1000–5000 | Compensate for packet loss and I/O latency |
| **High Latency / Unstable WAN** | 1000+ | 5000+ | Favor fewer reconnects; allow large batch replication |
**Tips:**
1. `connect_timeout_in_ms` covers TCP+TLS+gRPC handshake; increase for high-latency links.
2. `request_timeout_in_ms` should accommodate log batches and disk write delays on followers.
3. Timeouts mainly depend on **network RTT** and **disk I/O**, not hardware compute.
## Optimizing gRPC Compression for Performance
The d-engine now supports granular control of gRPC compression settings per service type, allowing you to fine-tune your deployment for optimal performance based on your specific environment.
### Granular Compression Control
```toml
# Example configuration for AWS VPC environment
[raft.rpc_compression]
replication_response = false # High-frequency, disable for CPU optimization
election_response = true # Low-frequency, minimal CPU impact
snapshot_response = true # Large data volume, benefits from compression
cluster_response = true # Configuration data, benefits from compression
client_response = false # Improves client read/write performance
```
### Performance Impact
Our benchmarks show that disabling compression for high-frequency operations (replication and client requests) can yield significant performance improvements in low-latency environments:
| Same-AZ VPC | 15-20% | 30-40% |
| Cross-AZ VPC | 5-10% | 10-15% |
| Cross-Region | -5% to -10% | -20% to -30% |
Note: In cross-region deployments, enabling compression for all traffic types is generally beneficial due to bandwidth constraints.
### Read Consistency and Compression
The ReadConsistencyPolicy (`LeaseRead`, `LinearizableRead`, `EventualConsistency`) works in conjunction with compression settings. For maximum performance:
1. Use `EventualConsistency` when possible for non-critical reads
2. Combine with `client_response = false` for lowest latency
3. Longer `lease_duration_ms` with `LeaseRead` reduces network round-trips
Example configuration for high-throughput read operations:
```toml
[raft.read_consistency]
default_policy = "LeaseRead"
lease_duration_ms = 500 # Longer lease duration = better performance
allow_client_override = true
[raft.rpc_compression]
client_response = false # Optimize client read performance
```
This combination provides strong consistency with minimal network overhead and no compression CPU penalty.
---
### Linearizable Read Batching Configuration
LinearizableRead batches requests to amortize consensus overhead. The key principle: **`size_threshold` should trigger quickly under load; `time_threshold` is the safety net for low concurrency.**
#### Configuration Parameters
```toml
[raft.read_consistency.read_batching]
size_threshold = 100 # Trigger point for high concurrency
time_threshold_ms = 50 # Fallback timeout for low concurrency
```
#### Core Tuning Principle
**`size_threshold` = "Large enough to batch effectively, small enough to trigger frequently"**
- **Too small (10-50)**: Batches too often → wasted overhead
- **Sweet spot (100-200)**: Triggers within 1-2ms under load → optimal
- **Too large (300+)**: Rarely reached → all batches wait for timeout → performance collapse
#### Quick Start Guide
**Step 1: Set `time_threshold_ms` based on network latency**
| Local/LAN | <1ms | 10-20 |
| AWS Same-Region VPC | 1-2ms | 40-60 |
| Cross-Region/WAN | 5-10ms | 80-120 |
**Step 2: Keep `size_threshold` in safe range**
```toml
size_threshold = 100-150 # Works for most workloads
```
**Step 3: Validate with benchmarks**
**Good configuration**: Most batches flush via `size_threshold`, not timeout
❌ **Bad configuration**: P99 latency approaches `time_threshold_ms` (all batches timing out)
#### Validated Configurations
**AWS EC2 c5.2xlarge, 3-node cluster, 1000 concurrent clients:**
| `100/50ms` | 141K ops/sec | 1.09ms | Optimal |
| `100/10ms` | 138K ops/sec | 1.15ms | Good for low-latency networks |
| `120/50ms` | 1.9K ops/sec | 52ms | ❌ Threshold too high, all timeouts |
| `10/5ms` | 24K ops/sec | 5.4ms | ❌ Window too short |
#### Troubleshooting
**Throughput drops to <10K ops/sec**
→ `size_threshold` too high, reduce to 100-150
**P99 latency ≈ `time_threshold_ms`**
→ Batches timing out instead of filling, increase timeout or decrease threshold
---
### Implementation Recommendations
1. **For your specific AWS VPC environment**: I recommend disabling compression for `replication_response` and `client_response` as you've already done. This is optimal for same-VPC deployments where network latency is negligible.
2. **Accept Compressed vs Send Compressed**: Always keep `accept_compressed` enabled for all services to maintain compatibility with clients. This has minimal performance impact when no compressed data is received.
3. **Granular Control**: The design allows you to optimize based on actual data patterns - snapshot transfers benefit from compression regardless of environment, while high-frequency operations like replication and client responses perform better without compression in low-latency networks.
This implementation provides a clean, configurable approach that follows Rust best practices and provides clear documentation for users.