# Raft Throughput Optimization Guide for Tonic gRPC in Rust
This guide provides empirically-tuned strategies to improve Raft performance using [tonic](https://docs.rs/tonic/latest/tonic/) with connection type isolation and configurable persistence. These optimizations address critical bottlenecks in consensus systems where network transport and disk I/O impact throughput and latency.
## Connection Type Strategy
We implement three distinct connection types to prevent head-of-line blocking:
| `Control` | Consensus operations | Heartbeats, Votes, Config changes | Latency-sensitive (sub-ms) |
| `Data` | Log replication | AppendEntries, Log writes | Throughput-optimized |
| `Bulk` | Large transfers | Snapshot streaming | Bandwidth-intensive |
```rust,ignore
pub(crate) enum ConnectionType {
Control, // Elections/heartbeats
Data, // Log replication
Bulk, // Snapshot transmission
}
```
## Persistence Strategy & Throughput/Latency Trade-offs
`MemFirst` is the only persistence strategy in v0.2.4+. It writes to OS page cache (process-crash safe, not power-loss safe) and flushes asynchronously.
### Strategy Configuration
```toml
[raft.persistence]
strategy = "MemFirst"
flush_policy = { Batch = { idle_flush_interval_ms = 1000 } }
```
### `MemFirst` Strategy
- **Write Path**: Entry written to OS page cache immediately; IO thread flushes asynchronously.
- **Durability**: Process crash safe (OS page cache survives). Power loss may drop the most recent unflushed entries.
- **Throughput**: High. Writes batch naturally; no per-write fsync.
### `FlushPolicy` Tuning
- **`Batch { idle_flush_interval_ms }`**: Flush after this many milliseconds of idle time.
- Lower values reduce data loss window but increase IO pressure.
- Default `1000` ms is suitable for most workloads.
> **Note**: `DiskFirst` strategy was removed in v0.2.4. `MemFirst` writes to OS page cache and is process-crash safe but **not power-loss safe** — `idle_flush_interval_ms` controls flush frequency but does not provide fsync-level durability.
## Batching Configuration
The `max_batch_size` controls how many commands are drained per Raft loop iteration.
```toml
[raft.batching]
max_batch_size = 200 # default, suitable for most deployments
```
| Embedded 3-node | **200** | Matches typical concurrent client counts; higher values yield diminishing returns |
| Standalone 3-node | **200** | Network RTT dominates; batch size has limited impact |
| High concurrency (500+ clients) | **500** | Increase if HC Write throughput plateaus |
> **Rule of thumb**: For embedded mode, optimal `max_batch_size ≈ concurrent_client_count`. For standalone, keep at 200 unless profiling shows cmd_rx consistently saturated.
## Configuration Tuning
### Control Plane (`[network.control]`)
```toml
concurrency_limit = 8192
max_concurrent_streams = 500
connection_window_size = 2_097_152 # 2MB
http2_keep_alive_timeout_in_secs = 20 # Aggressive timeout
```
**Why it matters:**
- Ensures election timeouts aren't missed during load
- Prevents heartbeat delays that cause unnecessary leader changes
- Uses smaller windows for faster roundtrips
### Data Plane (`[network.data]`)
```toml
max_concurrent_streams = 500
connection_window_size = 6_291_456 # 6MB
request_timeout_in_ms = 200 # Batch-friendly
```
**Optimization rationale:**
- Larger windows accommodate log batches
- Stream count aligns with typical pipelining depth
- Timeout tuned for batch processing, not individual entries
### Bulk Plane (`[network.bulk]` - Recommended)
```toml
# SNAPSHOT-SPECIFIC SETTINGS (EXAMPLE)
connect_timeout_in_ms = 1000 # Slow-start connections
request_timeout_in_ms = 30000 # 30s for large transfers
concurrency_limit = 4 # Few concurrent streams
connection_window_size = 33_554_432 # 32MB window
```
**Snapshot considerations:**
- Requires 10-100x larger windows than data plane
- Higher timeouts for GB-range transfers
- Compression essential (Gzip enabled in implementation)
## Critical Code Implementation
### Connection Type Routing
```rust,ignore
// Control operations
membership.get_peer_channel(peer_id, ConnectionType::Control)
// Data operations
membership.get_peer_channel(peer_id, ConnectionType::Data)
// Snapshot transfers
membership.get_peer_channel(leader_id, ConnectionType::Bulk)
```
### gRPC Server Tuning
```rust,ignore
tonic::transport::Server::builder()
.timeout(Duration::from_millis(control_config.request_timeout_in_ms))
.max_concurrent_streams(control_config.max_concurrent_streams)
.max_frame_size(Some(data_config.max_frame_size))
.initial_connection_window_size(data_config.connection_window_size)
```
## Performance Results (Optimization Impact)
| Throughput | 368 ops/sec | 373 ops/sec | +1.3% |
| p99 Latency | 5543 µs | 4703 µs | **-15.2%** |
| p99.9 Latency | 14015 µs | 11279 µs | -19.5% |
> **Key improvement**: 15% reduction in tail latency - critical for consensus stability
> **Note**: These metrics show the impact of connection pooling optimization. These results can be further improved by tuning the PersistenceStrategy for your specific workload.
>
> For absolute performance benchmarks, see [v0.2.4 Performance Report](https://github.com/deventlab/d-engine/tree/main/benches/reports/v0.2.4/bench_report_v0.2.4.md)
## Operational Recommendations
1. **Pre-warm connections** during node initialization
2. **Monitor connection types separately**:
```bash
netstat -an | grep ":9081" | grep ESTABLISHED | wc -l
netstat -an | grep ":9082" | grep ESTABLISHED
```
3. **Size bulk windows** for snapshot sizes:
```rust,ignore
connection_window_size = max_snapshot_size * 1.2
```
4. **Compress snapshots**:
```rust,ignore
.send_compressed(CompressionEncoding::Gzip)
.accept_compressed(CompressionEncoding::Gzip)
```
5. **Monitor Flush Lag**: When using `MemFirst`, monitor the difference between `last_log_index` and `durable_index`. A growing gap indicates the disk is not keeping up with writes, increasing potential data loss.
## Anti-Patterns to Avoid
```rust,ignore
// DON'T: Use same connection for control and data
get_peer_channel(peer_id, ConnectionType::Data).await?;
client.request_vote(...) // Control operation on data channel
// DO: Strict separation
get_peer_channel(peer_id, ConnectionType::Control).await?;
client.request_vote(...)
// DON'T: Set idle_flush_interval_ms too low — defeats batching.
[strategy = "MemFirst"]
flush_policy = { Batch = { idle_flush_interval_ms = 1 } } // Near-synchronous; low throughput
// DO: Use a generous idle interval to amortize disk I/O cost.
[strategy = "MemFirst"]
flush_policy = { Batch = { idle_flush_interval_ms = 1000 } }
```
## Why Connection Isolation and Strategy Choice Matters
1. **Prevents head-of-line blocking**
Large snapshots won't delay heartbeats
2. **Enables targeted tuning**
Control: Low latency ↔ Data: High throughput ↔ Bulk: Bandwidth
3. **Improves fault containment**
Connection issues affect only one operation type
4. **Decouples Performance from Durability**
`MemFirst` with tunable `idle_flush_interval_ms` lets you balance write throughput against flush frequency.
## Reference Deployment Configurations
Below are example configurations for different deployment scenarios.
Adjust values based on snapshot size, log append rate, and cluster size.
### 1. Single Node (Local Dev / Testing)
- CPU: 4 cores
• Memory: 8 GB
• Network: Localhost
```toml
[raft.persistence]
strategy = "MemFirst"
flush_policy = { Batch = { idle_flush_interval_ms = 1000 } }
[network.control]
concurrency_limit = 10
max_concurrent_streams = 64
connection_window_size = 1_048_576 # 1MB
[network.data]
concurrency_limit = 20
max_concurrent_streams = 128
connection_window_size = 2_097_152 # 2MB
[network.bulk]
concurrency_limit = 2
connection_window_size = 8_388_608 # 8MB
request_timeout_in_ms = 10_000 # 10s
```
**Tip**: Single-node setups focus on low resource usage; bulk window size can be smaller since snapshots are local.
### 2. 3-Node Public Cloud Cluster (Medium Durability)
- Instance Type: 4 vCPU / 16 GB RAM (e.g., AWS m6i.large, GCP n2-standard-4)
• Network: 10 Gbps
• Priority: Balanced throughput and durability
```toml
[raft.persistence]
strategy = "MemFirst"
flush_policy = { Batch = { idle_flush_interval_ms = 1000 } }
[network.control]
concurrency_limit = 20
max_concurrent_streams = 128
connection_window_size = 2_097_152 # 2MB
[network.data]
concurrency_limit = 30
max_concurrent_streams = 256
connection_window_size = 4_194_304 # 4MB
[network.bulk]
concurrency_limit = 4
connection_window_size = 33_554_432 # 32MB
request_timeout_in_ms = 30_000 # 30s for multi-GB snapshots
```
**Tip**: For public cloud, moderate concurrency and 32MB bulk windows ensure stable snapshot streaming without affecting heartbeats. The batch policy is tuned for high throughput with a reasonable data loss window.
### 3. 5-Node High-Durability Cluster (Production)
- Instance Type: 8 vCPU / 32 GB RAM (e.g., AWS m6i.xlarge)
• Network: 25 Gbps
• Priority: Data Integrity over Write Latency
```toml
[raft.persistence]
strategy = "MemFirst"
flush_policy = { Batch = { idle_flush_interval_ms = 100 } } # More frequent flush for durability
[network.control]
concurrency_limit = 50
max_concurrent_streams = 256
connection_window_size = 4_194_304 # 4MB
[network.data]
concurrency_limit = 80
max_concurrent_streams = 512
connection_window_size = 8_388_608 # 8MB
[network.bulk]
concurrency_limit = 8
connection_window_size = 67_108_864 # 64MB
request_timeout_in_ms = 60_000 # 60s for large snapshots
```
**Tip**: For higher write persistence within a process lifecycle, lower `idle_flush_interval_ms` (e.g., 100ms). Note: `MemFirst` is not power-loss safe regardless of flush interval.
## Network Environment Tuning Recommendations
These parameters are primarily **network-dependent**, not CPU/memory dependent.
Adjust them based on latency, packet loss, and connection stability.
| **Local / In-Cluster (LAN)** | 60 | 10 | 5 | 16_777_215 (16MB) | Low latency & stable; defaults are fine |
| **Cross-Region / Stable WAN** | 60 | 15 | 8 | 16_777_215 (16MB) | Slightly longer keep-alive to avoid false disconnects |
| **Public Cloud / Moderate Loss** | 60 | 20 | 10 | 33_554_432 (32MB) | Higher interval & timeout for lossy links; larger frame helps batch logs |
| **High Latency / Unstable WAN** | 120 | 30 | 15 | 33_554_432 (32MB) | Longer timeouts prevent spurious drops |
**Guidelines:**
1. Keep-alive interval ≈ 1/3 of timeout.
2. Increase `max_frame_size` only if batch logs or snapshots exceed 16MB.
3. High-latency WAN: favor fewer reconnects over aggressive failure detection.
4. These settings are independent of CPU and memory; focus on network RTT and stability.
### RPC Timeout Guidance
`connect_timeout_in_ms` and `request_timeout_in_ms` depend on **network latency and I/O**, not CPU or memory.
| **Local / In-Cluster (LAN)** | 50–100 | 100–300 | Very low RTT; fast retries |
| **Cross-Region / Stable WAN** | 200–500 | 300–1000 | Higher RTT, moderate batch sizes |
| **Public Cloud / Moderate Loss** | 500–1000 | 1000–5000 | Compensate for packet loss and I/O latency |
| **High Latency / Unstable WAN** | 1000+ | 5000+ | Favor fewer reconnects; allow large batch replication |
**Tips:**
1. `connect_timeout_in_ms` covers TCP+TLS+gRPC handshake; increase for high-latency links.
2. `request_timeout_in_ms` should accommodate log batches and disk write delays on followers.
3. Timeouts mainly depend on **network RTT** and **disk I/O**, not hardware compute.
## Optimizing gRPC Compression for Performance
The d-engine now supports granular control of gRPC compression settings per service type, allowing you to fine-tune your deployment for optimal performance based on your specific environment.
### Granular Compression Control
```toml
# Example configuration for AWS VPC environment
[raft.rpc_compression]
replication_response = false # High-frequency, disable for CPU optimization
election_response = true # Low-frequency, minimal CPU impact
snapshot_response = true # Large data volume, benefits from compression
cluster_response = true # Configuration data, benefits from compression
client_response = false # Improves client read/write performance
```
### Performance Impact
Our benchmarks show that disabling compression for high-frequency operations (replication and client requests) can yield significant performance improvements in low-latency environments:
| Same-AZ VPC | 15-20% | 30-40% |
| Cross-AZ VPC | 5-10% | 10-15% |
| Cross-Region | -5% to -10% | -20% to -30% |
Note: In cross-region deployments, enabling compression for all traffic types is generally beneficial due to bandwidth constraints.
### Read Consistency and Compression
The ReadConsistencyPolicy (`LeaseRead`, `LinearizableRead`, `EventualConsistency`) works in conjunction with compression settings. For maximum performance:
1. Use `EventualConsistency` when possible for non-critical reads
2. Combine with `client_response = false` for lowest latency
3. Longer `lease_duration_ms` with `LeaseRead` reduces network round-trips
Example configuration for high-throughput read operations:
```toml
[raft.read_consistency]
default_policy = "LeaseRead"
lease_duration_ms = 500 # Longer lease duration = better performance
allow_client_override = true
[raft.rpc_compression]
client_response = false # Optimize client read performance
```
This combination provides strong consistency with minimal network overhead and no compression CPU penalty.
---
### Linearizable Read Batching Configuration
LinearizableRead batches requests to amortize consensus overhead. The key principle: **`size_threshold` should trigger quickly under load; `time_threshold` is the safety net for low concurrency.**
#### Configuration Parameters
```toml
[raft.read_consistency.read_batching]
size_threshold = 100 # Trigger point for high concurrency
time_threshold_ms = 50 # Fallback timeout for low concurrency
```
#### Core Tuning Principle
**`size_threshold` = "Large enough to batch effectively, small enough to trigger frequently"**
- **Too small (10-50)**: Batches too often → wasted overhead
- **Sweet spot (100-200)**: Triggers within 1-2ms under load → optimal
- **Too large (300+)**: Rarely reached → all batches wait for timeout → performance collapse
#### Quick Start Guide
**Step 1: Set `time_threshold_ms` based on network latency**
| Local/LAN | <1ms | 10-20 |
| AWS Same-Region VPC | 1-2ms | 40-60 |
| Cross-Region/WAN | 5-10ms | 80-120 |
**Step 2: Keep `size_threshold` in safe range**
```toml
size_threshold = 100-150 # Works for most workloads
```
**Step 3: Validate with benchmarks**
**Good configuration**: Most batches flush via `size_threshold`, not timeout
❌ **Bad configuration**: P99 latency approaches `time_threshold_ms` (all batches timing out)
#### Validated Configurations
**AWS EC2 c5.2xlarge, 3-node cluster, 1000 concurrent clients:**
| `100/50ms` | 141K ops/sec | 1.09ms | Optimal |
| `100/10ms` | 138K ops/sec | 1.15ms | Good for low-latency networks |
| `120/50ms` | 1.9K ops/sec | 52ms | ❌ Threshold too high, all timeouts |
| `10/5ms` | 24K ops/sec | 5.4ms | ❌ Window too short |
#### Troubleshooting
**Throughput drops to <10K ops/sec**
→ `size_threshold` too high, reduce to 100-150
**P99 latency ≈ `time_threshold_ms`**
→ Batches timing out instead of filling, increase timeout or decrease threshold
---
### Implementation Recommendations
1. **For your specific AWS VPC environment**: I recommend disabling compression for `replication_response` and `client_response` as you've already done. This is optimal for same-VPC deployments where network latency is negligible.
2. **Accept Compressed vs Send Compressed**: Always keep `accept_compressed` enabled for all services to maintain compatibility with clients. This has minimal performance impact when no compressed data is received.
3. **Granular Control**: The design allows you to optimize based on actual data patterns - snapshot transfers benefit from compression regardless of environment, while high-frequency operations like replication and client responses perform better without compression in low-latency networks.
This implementation provides a clean, configurable approach that follows Rust best practices and provides clear documentation for users.