Module throughput_optimization_guide

Expand description

§Raft Throughput Optimization Guide for Tonic gRPC in Rust

This guide provides empirically-tuned strategies to improve Raft performance using tonic with connection type isolation and configurable persistence. These optimizations address critical bottlenecks in consensus systems where network transport and disk I/O impact throughput and latency.

§Connection Type Strategy

We implement three distinct connection types to prevent head-of-line blocking:

Type	Purpose	Critical Operations	Performance Profile
`Control`	Consensus operations	Heartbeats, Votes, Config changes	Latency-sensitive (sub-ms)
`Data`	Log replication	AppendEntries, Log writes	Throughput-optimized
`Bulk`	Large transfers	Snapshot streaming	Bandwidth-intensive

pub(crate) enum ConnectionType {
    Control,  // Elections/heartbeats
    Data,     // Log replication
    Bulk,     // Snapshot transmission
}

§Persistence Strategy & Throughput/Latency Trade-offs

MemFirst is the only persistence strategy in v0.2.4+. It writes to OS page cache (process-crash safe, not power-loss safe) and flushes asynchronously.

§Strategy Configuration

[raft.persistence]
strategy = "MemFirst"
flush_policy = { Batch = { idle_flush_interval_ms = 1000 } }

§`MemFirst` Strategy

Write Path: Entry written to OS page cache immediately; IO thread flushes asynchronously.
Durability: Process crash safe (OS page cache survives). Power loss may drop the most recent unflushed entries.
Throughput: High. Writes batch naturally; no per-write fsync.

§`FlushPolicy` Tuning

Batch { idle_flush_interval_ms }: Flush after this many milliseconds of idle time.
- Lower values reduce data loss window but increase IO pressure.
- Default 1000 ms is suitable for most workloads.

Note: DiskFirst strategy was removed in v0.2.4. MemFirst writes to OS page cache and is process-crash safe but not power-loss safe — idle_flush_interval_ms controls flush frequency but does not provide fsync-level durability.

§Batching Configuration

The max_batch_size controls how many commands are drained per Raft loop iteration.

[raft.batching]
max_batch_size = 200  # default, suitable for most deployments

Deployment	Recommended	Rationale
Embedded 3-node	200	Matches typical concurrent client counts; higher values yield diminishing returns
Standalone 3-node	200	Network RTT dominates; batch size has limited impact
High concurrency (500+ clients)	500	Increase if HC Write throughput plateaus

Rule of thumb: For embedded mode, optimal max_batch_size ≈ concurrent_client_count. For standalone, keep at 200 unless profiling shows cmd_rx consistently saturated.

§Configuration Tuning

§Control Plane (`[network.control]`)

concurrency_limit = 8192
max_concurrent_streams = 500
connection_window_size = 2_097_152  # 2MB
http2_keep_alive_timeout_in_secs = 20  # Aggressive timeout

Why it matters:

Ensures election timeouts aren’t missed during load
Prevents heartbeat delays that cause unnecessary leader changes
Uses smaller windows for faster roundtrips

§Data Plane (`[network.data]`)

max_concurrent_streams = 500
connection_window_size = 6_291_456  # 6MB
request_timeout_in_ms = 200          # Batch-friendly

Optimization rationale:

Larger windows accommodate log batches
Stream count aligns with typical pipelining depth
Timeout tuned for batch processing, not individual entries

§Bulk Plane (`[network.bulk]` - Recommended)

# SNAPSHOT-SPECIFIC SETTINGS (EXAMPLE)
connect_timeout_in_ms = 1000         # Slow-start connections
request_timeout_in_ms = 30000        # 30s for large transfers
concurrency_limit = 4                # Few concurrent streams
connection_window_size = 33_554_432  # 32MB window

Snapshot considerations:

Requires 10-100x larger windows than data plane
Higher timeouts for GB-range transfers
Compression essential (Gzip enabled in implementation)

§Critical Code Implementation

§Connection Type Routing

// Control operations
membership.get_peer_channel(peer_id, ConnectionType::Control)

// Data operations
membership.get_peer_channel(peer_id, ConnectionType::Data)

// Snapshot transfers
membership.get_peer_channel(leader_id, ConnectionType::Bulk)

§gRPC Server Tuning

tonic::transport::Server::builder()
    .timeout(Duration::from_millis(control_config.request_timeout_in_ms))
    .max_concurrent_streams(control_config.max_concurrent_streams)
    .max_frame_size(Some(data_config.max_frame_size))
    .initial_connection_window_size(data_config.connection_window_size)

§Performance Results (Optimization Impact)

Metric	Before	After	Delta
Throughput	368 ops/sec	373 ops/sec	+1.3%
p99 Latency	5543 µs	4703 µs	-15.2%
p99.9 Latency	14015 µs	11279 µs	-19.5%

Key improvement: 15% reduction in tail latency - critical for consensus stability
Note: These metrics show the impact of connection pooling optimization. These results can be further improved by tuning the PersistenceStrategy for your specific workload.

For absolute performance benchmarks, see v0.2.4 Performance Report

§Operational Recommendations

Pre-warm connections during node initialization

Monitor connection types separately:

# Control plane
netstat -an | grep ":9081" | grep ESTABLISHED | wc -l

# Data plane
netstat -an | grep ":9082" | grep ESTABLISHED

Size bulk windows for snapshot sizes:

connection_window_size = max_snapshot_size * 1.2

Compress snapshots:

.send_compressed(CompressionEncoding::Gzip)
.accept_compressed(CompressionEncoding::Gzip)

Monitor Flush Lag: When using MemFirst, monitor the difference between last_log_index and durable_index. A growing gap indicates the disk is not keeping up with writes, increasing potential data loss.

§Anti-Patterns to Avoid

// DON'T: Use same connection for control and data
get_peer_channel(peer_id, ConnectionType::Data).await?;
client.request_vote(...)  // Control operation on data channel

// DO: Strict separation
get_peer_channel(peer_id, ConnectionType::Control).await?;
client.request_vote(...)

// DON'T: Set idle_flush_interval_ms too low — defeats batching.
[strategy = "MemFirst"]
flush_policy = { Batch = { idle_flush_interval_ms = 1 } } // Near-synchronous; low throughput

// DO: Use a generous idle interval to amortize disk I/O cost.
[strategy = "MemFirst"]
flush_policy = { Batch = { idle_flush_interval_ms = 1000 } }

§Why Connection Isolation and Strategy Choice Matters

Prevents head-of-line blocking Large snapshots won’t delay heartbeats
Enables targeted tuning Control: Low latency ↔ Data: High throughput ↔ Bulk: Bandwidth
Improves fault containment Connection issues affect only one operation type
Decouples Performance from Durability MemFirst with tunable idle_flush_interval_ms lets you balance write throughput against flush frequency.

§Reference Deployment Configurations

Below are example configurations for different deployment scenarios. Adjust values based on snapshot size, log append rate, and cluster size.

§1. Single Node (Local Dev / Testing)

CPU: 4 cores • Memory: 8 GB • Network: Localhost

[raft.persistence]
strategy = "MemFirst"
flush_policy = { Batch = { idle_flush_interval_ms = 1000 } }

[network.control]
concurrency_limit = 10
max_concurrent_streams = 64
connection_window_size = 1_048_576   # 1MB

[network.data]
concurrency_limit = 20
max_concurrent_streams = 128
connection_window_size = 2_097_152   # 2MB

[network.bulk]
concurrency_limit = 2
connection_window_size = 8_388_608   # 8MB
request_timeout_in_ms = 10_000       # 10s

Tip: Single-node setups focus on low resource usage; bulk window size can be smaller since snapshots are local.

§2. 3-Node Public Cloud Cluster (Medium Durability)

Instance Type: 4 vCPU / 16 GB RAM (e.g., AWS m6i.large, GCP n2-standard-4) • Network: 10 Gbps • Priority: Balanced throughput and durability

[raft.persistence]
strategy = "MemFirst"
flush_policy = { Batch = { idle_flush_interval_ms = 1000 } }

[network.control]
concurrency_limit = 20
max_concurrent_streams = 128
connection_window_size = 2_097_152   # 2MB

[network.data]
concurrency_limit = 30
max_concurrent_streams = 256
connection_window_size = 4_194_304   # 4MB

[network.bulk]
concurrency_limit = 4
connection_window_size = 33_554_432  # 32MB
request_timeout_in_ms = 30_000       # 30s for multi-GB snapshots

Tip: For public cloud, moderate concurrency and 32MB bulk windows ensure stable snapshot streaming without affecting heartbeats. The batch policy is tuned for high throughput with a reasonable data loss window.

§3. 5-Node High-Durability Cluster (Production)

Instance Type: 8 vCPU / 32 GB RAM (e.g., AWS m6i.xlarge) • Network: 25 Gbps • Priority: Data Integrity over Write Latency

[raft.persistence]
strategy = "MemFirst"
flush_policy = { Batch = { idle_flush_interval_ms = 100 } }  # More frequent flush for durability

[network.control]
concurrency_limit = 50
max_concurrent_streams = 256
connection_window_size = 4_194_304   # 4MB

[network.data]
concurrency_limit = 80
max_concurrent_streams = 512
connection_window_size = 8_388_608   # 8MB

[network.bulk]
concurrency_limit = 8
connection_window_size = 67_108_864  # 64MB
request_timeout_in_ms = 60_000       # 60s for large snapshots

Tip: For higher write persistence within a process lifecycle, lower idle_flush_interval_ms (e.g., 100ms). Note: MemFirst is not power-loss safe regardless of flush interval.

§Network Environment Tuning Recommendations

These parameters are primarily network-dependent, not CPU/memory dependent.

Adjust them based on latency, packet loss, and connection stability.

Environment	tcp_keepalive_in_secs	http2_keep_alive_interval_in_secs	http2_keep_alive_timeout_in_secs	max_frame_size	Notes
Local / In-Cluster (LAN)	60	10	5	16_777_215 (16MB)	Low latency & stable; defaults are fine
Cross-Region / Stable WAN	60	15	8	16_777_215 (16MB)	Slightly longer keep-alive to avoid false disconnects
Public Cloud / Moderate Loss	60	20	10	33_554_432 (32MB)	Higher interval & timeout for lossy links; larger frame helps batch logs
High Latency / Unstable WAN	120	30	15	33_554_432 (32MB)	Longer timeouts prevent spurious drops

Guidelines:

Keep-alive interval ≈ 1/3 of timeout.
Increase max_frame_size only if batch logs or snapshots exceed 16MB.
High-latency WAN: favor fewer reconnects over aggressive failure detection.
These settings are independent of CPU and memory; focus on network RTT and stability.

§RPC Timeout Guidance

connect_timeout_in_ms and request_timeout_in_ms depend on network latency and I/O, not CPU or memory.

Environment	connect_timeout_in_ms	request_timeout_in_ms	Notes
Local / In-Cluster (LAN)	50–100	100–300	Very low RTT; fast retries
Cross-Region / Stable WAN	200–500	300–1000	Higher RTT, moderate batch sizes
Public Cloud / Moderate Loss	500–1000	1000–5000	Compensate for packet loss and I/O latency
High Latency / Unstable WAN	1000+	5000+	Favor fewer reconnects; allow large batch replication

Tips:

connect_timeout_in_ms covers TCP+TLS+gRPC handshake; increase for high-latency links.
request_timeout_in_ms should accommodate log batches and disk write delays on followers.
Timeouts mainly depend on network RTT and disk I/O, not hardware compute.

§Optimizing gRPC Compression for Performance

The d-engine now supports granular control of gRPC compression settings per service type, allowing you to fine-tune your deployment for optimal performance based on your specific environment.

§Granular Compression Control

# Example configuration for AWS VPC environment
[raft.rpc_compression]
replication_response = false  # High-frequency, disable for CPU optimization
election_response = true      # Low-frequency, minimal CPU impact
snapshot_response = true      # Large data volume, benefits from compression
cluster_response = true       # Configuration data, benefits from compression
client_response = false       # Improves client read/write performance

§Performance Impact

Our benchmarks show that disabling compression for high-frequency operations (replication and client requests) can yield significant performance improvements in low-latency environments:

Scenario	CPU Savings	Throughput Improvement
Same-AZ VPC	15-20%	30-40%
Cross-AZ VPC	5-10%	10-15%
Cross-Region	-5% to -10%	-20% to -30%

Note: In cross-region deployments, enabling compression for all traffic types is generally beneficial due to bandwidth constraints.

§Read Consistency and Compression

The ReadConsistencyPolicy (LeaseRead, LinearizableRead, EventualConsistency) works in conjunction with compression settings. For maximum performance:

Use EventualConsistency when possible for non-critical reads
Combine with client_response = false for lowest latency
Longer lease_duration_ms with LeaseRead reduces network round-trips

Example configuration for high-throughput read operations:

[raft.read_consistency]
default_policy = "LeaseRead"
lease_duration_ms = 500  # Longer lease duration = better performance
allow_client_override = true

[raft.rpc_compression]
client_response = false  # Optimize client read performance

This combination provides strong consistency with minimal network overhead and no compression CPU penalty.

§Linearizable Read Batching Configuration

LinearizableRead batches requests to amortize consensus overhead. The key principle: size_threshold should trigger quickly under load; time_threshold is the safety net for low concurrency.

§Configuration Parameters

[raft.read_consistency.read_batching]
size_threshold = 100      # Trigger point for high concurrency
time_threshold_ms = 50    # Fallback timeout for low concurrency

§Core Tuning Principle

size_threshold = “Large enough to batch effectively, small enough to trigger frequently”

Too small (10-50): Batches too often → wasted overhead
Sweet spot (100-200): Triggers within 1-2ms under load → optimal
Too large (300+): Rarely reached → all batches wait for timeout → performance collapse

§Quick Start Guide

Step 1: Set time_threshold_ms based on network latency

Environment	RTT	Recommended `time_threshold_ms`
Local/LAN	<1ms	10-20
AWS Same-Region VPC	1-2ms	40-60
Cross-Region/WAN	5-10ms	80-120

Step 2: Keep size_threshold in safe range

size_threshold = 100-150  # Works for most workloads

Step 3: Validate with benchmarks

Good configuration: Most batches flush via size_threshold, not timeout
❌ Bad configuration: P99 latency approaches time_threshold_ms (all batches timing out)

§Validated Configurations

AWS EC2 c5.2xlarge, 3-node cluster, 1000 concurrent clients:

Config	Throughput	P99 Latency	Result
`100/50ms`	141K ops/sec	1.09ms	Optimal
`100/10ms`	138K ops/sec	1.15ms	Good for low-latency networks
`120/50ms`	1.9K ops/sec	52ms	❌ Threshold too high, all timeouts
`10/5ms`	24K ops/sec	5.4ms	❌ Window too short

§Troubleshooting

Throughput drops to <10K ops/sec
→ size_threshold too high, reduce to 100-150

P99 latency ≈ time_threshold_ms
→ Batches timing out instead of filling, increase timeout or decrease threshold

§Implementation Recommendations

For your specific AWS VPC environment: I recommend disabling compression for replication_response and client_response as you’ve already done. This is optimal for same-VPC deployments where network latency is negligible.
Accept Compressed vs Send Compressed: Always keep accept_compressed enabled for all services to maintain compatibility with clients. This has minimal performance impact when no compressed data is received.
Granular Control: The design allows you to optimize based on actual data patterns - snapshot transfers benefit from compression regardless of environment, while high-frequency operations like replication and client responses perform better without compression in low-latency networks.

This implementation provides a clean, configurable approach that follows Rust best practices and provides clear documentation for users.