d-engine 0.2.3 - Docs.rs

# Raft Throughput Optimization Guide for Tonic gRPC in Rust

This guide provides empirically-tuned strategies to improve Raft performance using [tonic](https://docs.rs/tonic/latest/tonic/) with connection type isolation and configurable persistence. These optimizations address critical bottlenecks in consensus systems where network transport and disk I/O impact throughput and latency.

## Connection Type Strategy

We implement three distinct connection types to prevent head-of-line blocking:

| **Type**  | **Purpose**          | **Critical Operations**           | **Performance Profile**    |
| --------- | -------------------- | --------------------------------- | -------------------------- |
| `Control` | Consensus operations | Heartbeats, Votes, Config changes | Latency-sensitive (sub-ms) |
| `Data`    | Log replication      | AppendEntries, Log writes         | Throughput-optimized       |
| `Bulk`    | Large transfers      | Snapshot streaming                | Bandwidth-intensive        |

```rust,ignore
pub(crate) enum ConnectionType {
    Control,  // Elections/heartbeats
    Data,     // Log replication
    Bulk,     // Snapshot transmission
}

```

## Persistence Strategy & Throughput/Latency Trade-offs

The choice of persistence strategy (`DiskFirst` vs. `MemFirst`) and `FlushPolicy` is the primary mechanism for balancing write throughput against write latency and durability guarantees.

### Strategy Configuration

Configure the strategy in your `config.toml`:

```toml
[raft.persistence]
# Choose based on your priority: durability (DiskFirst) or low-latency writes (MemFirst)
# DEFAULT in v0.2.3+: DiskFirst (Raft protocol compliance)
strategy = "DiskFirst"  # Default in v0.2.3+
# For benchmark/development: strategy = "MemFirst"

# Flush policy works in conjunction with the chosen strategy.
flush_policy = { Batch = { threshold = 1000, interval_ms = 100 } }
# flush_policy = "Immediate"

```

### `DiskFirst` Strategy (High Durability)

- **Write Path**: An entry is written to disk _before_ being acknowledged and added to the in-memory store.
- **Latency/Throughput**: Higher, more consistent write latency per operation. Maximum throughput may be limited by disk write speed.
- **Durability**: Strong. Acknowledged entries are always on disk. No data loss on process crash.
- **Use Case**: Systems where data integrity is the absolute highest priority, even at the cost of higher write latency.

### `MemFirst` Strategy (High Throughput / Low Latency)

- **Write Path**: An entry is immediately added to the in-memory store and acknowledged. Disk persistence happens asynchronously.
- **Latency/Throughput**: Very low write latency and high throughput, as writes are batched for disk I/O.
- **Durability**: Eventual. A process crash can lose the most recently acknowledged entries that haven't been flushed yet.
- **Use Case**: Systems requiring high write performance and low latency, where a small window of potential data loss is acceptable.

### `FlushPolicy` Tuning

The flush policy controls how asynchronous (`MemFirst`) writes are persisted to disk.

- **`Immediate`**: Flush every write synchronously. Use with `MemFirst` to approximate `DiskFirst` durability (but with different failure semantics) or with `DiskFirst` for the strongest guarantees. Highest durability, lowest performance.
- **`Batch { threshold, interval_ms }`**:
  - `threshold`: Flush when this many entries are pending. Larger values increase throughput but also the window of potential data loss.
  - `interval_ms`: Flush after this much time has passed, even if the threshold isn't met. Limits the maximum staleness of data on disk.

**Recommendation**: For most production use cases, `MemFirst` with a `Batch` policy offers the best balance of performance and durability. Tune the `threshold` based on your acceptable data loss window (RPO) and the `interval_ms` based on your crash recovery tolerance.

## Configuration Tuning

### Control Plane (`[network.control]`)

```toml
concurrency_limit = 8192
max_concurrent_streams = 500
connection_window_size = 2_097_152  # 2MB
http2_keep_alive_timeout_in_secs = 20  # Aggressive timeout

```

**Why it matters:**

- Ensures election timeouts aren't missed during load
- Prevents heartbeat delays that cause unnecessary leader changes
- Uses smaller windows for faster roundtrips

### Data Plane (`[network.data]`)

```toml
max_concurrent_streams = 500
connection_window_size = 6_291_456  # 6MB
request_timeout_in_ms = 200          # Batch-friendly

```

**Optimization rationale:**

- Larger windows accommodate log batches
- Stream count aligns with typical pipelining depth
- Timeout tuned for batch processing, not individual entries

### Bulk Plane (`[network.bulk]` - Recommended)

```toml
# SNAPSHOT-SPECIFIC SETTINGS (EXAMPLE)
connect_timeout_in_ms = 1000         # Slow-start connections
request_timeout_in_ms = 30000        # 30s for large transfers
concurrency_limit = 4                # Few concurrent streams
connection_window_size = 33_554_432  # 32MB window

```

**Snapshot considerations:**

- Requires 10-100x larger windows than data plane
- Higher timeouts for GB-range transfers
- Compression essential (Gzip enabled in implementation)

## Critical Code Implementation

### Connection Type Routing

```rust,ignore
// Control operations
membership.get_peer_channel(peer_id, ConnectionType::Control)

// Data operations
membership.get_peer_channel(peer_id, ConnectionType::Data)

// Snapshot transfers
membership.get_peer_channel(leader_id, ConnectionType::Bulk)

```

### gRPC Server Tuning

```rust,ignore
tonic::transport::Server::builder()
    .timeout(Duration::from_millis(control_config.request_timeout_in_ms))
    .max_concurrent_streams(control_config.max_concurrent_streams)
    .max_frame_size(Some(data_config.max_frame_size))
    .initial_connection_window_size(data_config.connection_window_size)

```

## Performance Results (Optimization Impact)

| **Metric**    | **Before**  | **After**   | **Delta**  |
| ------------- | ----------- | ----------- | ---------- |
| Throughput    | 368 ops/sec | 373 ops/sec | +1.3%      |
| p99 Latency   | 5543 µs     | 4703 µs     | **-15.2%** |
| p99.9 Latency | 14015 µs    | 11279 µs    | -19.5%     |

> **Key improvement**: 15% reduction in tail latency - critical for consensus stability  
> **Note**: These metrics show the impact of connection pooling optimization. These results can be further improved by tuning the PersistenceStrategy for your specific workload.
>
> For absolute performance benchmarks, see [v0.2.3 Performance Report](https://github.com/deventlab/d-engine/tree/main/benches/reports/v0.2.3/bench_report_v0.2.3.md)

## Operational Recommendations

1. **Pre-warm connections** during node initialization
2. **Monitor connection types separately**:

   ```bash
   # Control plane
   netstat -an | grep ":9081" | grep ESTABLISHED | wc -l

   # Data plane
   netstat -an | grep ":9082" | grep ESTABLISHED

   ```

3. **Size bulk windows** for snapshot sizes:

   ```rust,ignore
   connection_window_size = max_snapshot_size * 1.2

   ```

4. **Compress snapshots**:

   ```rust,ignore
   .send_compressed(CompressionEncoding::Gzip)
   .accept_compressed(CompressionEncoding::Gzip)

   ```

5. **Monitor Flush Lag**: When using `MemFirst`, monitor the difference between `last_log_index` and `durable_index`. A growing gap indicates the disk is not keeping up with writes, increasing potential data loss.

## Anti-Patterns to Avoid

```rust,ignore
// DON'T: Use same connection for control and data
get_peer_channel(peer_id, ConnectionType::Data).await?;
client.request_vote(...)  // Control operation on data channel

// DO: Strict separation
get_peer_channel(peer_id, ConnectionType::Control).await?;
client.request_vote(...)

// DON'T: Use MemFirst with Immediate flush for high-throughput scenarios
// This eliminates the performance benefit of MemFirst.
[strategy = "MemFirst"]
flush_policy = "Immediate" // Anti-pattern

// DO: Use a Batch policy to amortize disk I/O cost.
[strategy = "MemFirst"]
flush_policy = { Batch = { threshold = 1000, interval_ms = 100 } }

```

## Why Connection Isolation and Strategy Choice Matters

1. **Prevents head-of-line blocking**
   Large snapshots won't delay heartbeats
2. **Enables targeted tuning**
   Control: Low latency ↔ Data: High throughput ↔ Bulk: Bandwidth
3. **Improves fault containment**
   Connection issues affect only one operation type
4. **Decouples Performance from Durability**
   The `PersistenceStrategy` allows you to choose the right trade-off between fast writes (`MemFirst`) and guaranteed durability (`DiskFirst`) for your use case.

## Reference Deployment Configurations

Below are example configurations for different deployment scenarios.
Adjust values based on snapshot size, log append rate, and cluster size.

### 1. Single Node (Local Dev / Testing)

- CPU: 4 cores
  • Memory: 8 GB
  • Network: Localhost

```toml
[raft.persistence]
strategy = "DiskFirst"  # Default in v0.2.3+
# For benchmark/development: strategy = "MemFirst"
flush_policy = { Batch = { threshold = 500, interval_ms = 50 } }

[network.control]
concurrency_limit = 10
max_concurrent_streams = 64
connection_window_size = 1_048_576   # 1MB

[network.data]
concurrency_limit = 20
max_concurrent_streams = 128
connection_window_size = 2_097_152   # 2MB

[network.bulk]
concurrency_limit = 2
connection_window_size = 8_388_608   # 8MB
request_timeout_in_ms = 10_000       # 10s

```

**Tip**: Single-node setups focus on low resource usage; bulk window size can be smaller since snapshots are local. For developer iteration speed, consider using `strategy = "MemFirst"`.

### 2. 3-Node Public Cloud Cluster (Medium Durability)

- Instance Type: 4 vCPU / 16 GB RAM (e.g., AWS m6i.large, GCP n2-standard-4)
  • Network: 10 Gbps
  • Priority: Balanced throughput and durability

```toml
[raft.persistence]
strategy = "DiskFirst"  # Default in v0.2.3+
# For high-throughput scenarios: strategy = "MemFirst"
flush_policy = { Batch = { threshold = 2000, interval_ms = 200 } }

[network.control]
concurrency_limit = 20
max_concurrent_streams = 128
connection_window_size = 2_097_152   # 2MB

[network.data]
concurrency_limit = 30
max_concurrent_streams = 256
connection_window_size = 4_194_304   # 4MB

[network.bulk]
concurrency_limit = 4
connection_window_size = 33_554_432  # 32MB
request_timeout_in_ms = 30_000       # 30s for multi-GB snapshots

```

**Tip**: For public cloud, moderate concurrency and 32MB bulk windows ensure stable snapshot streaming without affecting heartbeats. The batch policy is tuned for high throughput with a reasonable data loss window.

### 3. 5-Node High-Durability Cluster (Production)

- Instance Type: 8 vCPU / 32 GB RAM (e.g., AWS m6i.xlarge)
  • Network: 25 Gbps
  • Priority: Data Integrity over Write Latency

```toml
[raft.persistence]
strategy = "DiskFirst" # Highest durability guarantee
# flush_policy is less critical for DiskFirst but can help with batching internal operations.

[network.control]
concurrency_limit = 50
max_concurrent_streams = 256
connection_window_size = 4_194_304   # 4MB

[network.data]
concurrency_limit = 80
max_concurrent_streams = 512
connection_window_size = 8_388_608   # 8MB

[network.bulk]
concurrency_limit = 8
connection_window_size = 67_108_864  # 64MB
request_timeout_in_ms = 60_000       # 60s for large snapshots

```

**Tip**: Use `DiskFirst` for systems where data integrity is non-negotiable (e.g., financial systems). Be aware that write throughput will be bound by disk I/O latency.

## Network Environment Tuning Recommendations

These parameters are primarily **network-dependent**, not CPU/memory dependent.

Adjust them based on latency, packet loss, and connection stability.

| **Environment**                  | **tcp_keepalive_in_secs** | **http2_keep_alive_interval_in_secs** | **http2_keep_alive_timeout_in_secs** | **max_frame_size** | **Notes**                                                                |
| -------------------------------- | ------------------------- | ------------------------------------- | ------------------------------------ | ------------------ | ------------------------------------------------------------------------ |
| **Local / In-Cluster (LAN)**     | 60                        | 10                                    | 5                                    | 16_777_215 (16MB)  | Low latency & stable; defaults are fine                                  |
| **Cross-Region / Stable WAN**    | 60                        | 15                                    | 8                                    | 16_777_215 (16MB)  | Slightly longer keep-alive to avoid false disconnects                    |
| **Public Cloud / Moderate Loss** | 60                        | 20                                    | 10                                   | 33_554_432 (32MB)  | Higher interval & timeout for lossy links; larger frame helps batch logs |
| **High Latency / Unstable WAN**  | 120                       | 30                                    | 15                                   | 33_554_432 (32MB)  | Longer timeouts prevent spurious drops                                   |

**Guidelines:**

1. Keep-alive interval ≈ 1/3 of timeout.
2. Increase `max_frame_size` only if batch logs or snapshots exceed 16MB.
3. High-latency WAN: favor fewer reconnects over aggressive failure detection.
4. These settings are independent of CPU and memory; focus on network RTT and stability.

### RPC Timeout Guidance

`connect_timeout_in_ms` and `request_timeout_in_ms` depend on **network latency and I/O**, not CPU or memory.

| **Environment**                  | **connect_timeout_in_ms** | **request_timeout_in_ms** | **Notes**                                             |
| -------------------------------- | ------------------------- | ------------------------- | ----------------------------------------------------- |
| **Local / In-Cluster (LAN)**     | 50–100                    | 100–300                   | Very low RTT; fast retries                            |
| **Cross-Region / Stable WAN**    | 200–500                   | 300–1000                  | Higher RTT, moderate batch sizes                      |
| **Public Cloud / Moderate Loss** | 500–1000                  | 1000–5000                 | Compensate for packet loss and I/O latency            |
| **High Latency / Unstable WAN**  | 1000+                     | 5000+                     | Favor fewer reconnects; allow large batch replication |

**Tips:**

1. `connect_timeout_in_ms` covers TCP+TLS+gRPC handshake; increase for high-latency links.
2. `request_timeout_in_ms` should accommodate log batches and disk write delays on followers.
3. Timeouts mainly depend on **network RTT** and **disk I/O**, not hardware compute.

## Optimizing gRPC Compression for Performance

The d-engine now supports granular control of gRPC compression settings per service type, allowing you to fine-tune your deployment for optimal performance based on your specific environment.

### Granular Compression Control

```toml
# Example configuration for AWS VPC environment
[raft.rpc_compression]
replication_response = false  # High-frequency, disable for CPU optimization
election_response = true      # Low-frequency, minimal CPU impact
snapshot_response = true      # Large data volume, benefits from compression
cluster_response = true       # Configuration data, benefits from compression
client_response = false       # Improves client read/write performance
```

### Performance Impact

Our benchmarks show that disabling compression for high-frequency operations (replication and client requests) can yield significant performance improvements in low-latency environments:

| Scenario     | CPU Savings | Throughput Improvement |
| ------------ | ----------- | ---------------------- |
| Same-AZ VPC  | 15-20%      | 30-40%                 |
| Cross-AZ VPC | 5-10%       | 10-15%                 |
| Cross-Region | -5% to -10% | -20% to -30%           |

Note: In cross-region deployments, enabling compression for all traffic types is generally beneficial due to bandwidth constraints.

### Read Consistency and Compression

The ReadConsistencyPolicy (`LeaseRead`, `LinearizableRead`, `EventualConsistency`) works in conjunction with compression settings. For maximum performance:

1. Use `EventualConsistency` when possible for non-critical reads
2. Combine with `client_response = false` for lowest latency
3. Longer `lease_duration_ms` with `LeaseRead` reduces network round-trips

Example configuration for high-throughput read operations:

```toml
[raft.read_consistency]
default_policy = "LeaseRead"
lease_duration_ms = 500  # Longer lease duration = better performance
allow_client_override = true

[raft.rpc_compression]
client_response = false  # Optimize client read performance

```

This combination provides strong consistency with minimal network overhead and no compression CPU penalty.

---

### Linearizable Read Batching Configuration

LinearizableRead batches requests to amortize consensus overhead. The key principle: **`size_threshold` should trigger quickly under load; `time_threshold` is the safety net for low concurrency.**

#### Configuration Parameters

```toml
[raft.read_consistency.read_batching]
size_threshold = 100      # Trigger point for high concurrency
time_threshold_ms = 50    # Fallback timeout for low concurrency
```

#### Core Tuning Principle

**`size_threshold` = "Large enough to batch effectively, small enough to trigger frequently"**

- **Too small (10-50)**: Batches too often → wasted overhead
- **Sweet spot (100-200)**: Triggers within 1-2ms under load → optimal
- **Too large (300+)**: Rarely reached → all batches wait for timeout → performance collapse

#### Quick Start Guide

**Step 1: Set `time_threshold_ms` based on network latency**

| Environment         | RTT    | Recommended `time_threshold_ms` |
| ------------------- | ------ | ------------------------------- |
| Local/LAN           | <1ms   | 10-20                           |
| AWS Same-Region VPC | 1-2ms  | 40-60                           |
| Cross-Region/WAN    | 5-10ms | 80-120                          |

**Step 2: Keep `size_threshold` in safe range**

```toml
size_threshold = 100-150  # Works for most workloads
```

**Step 3: Validate with benchmarks**

**Good configuration**: Most batches flush via `size_threshold`, not timeout  
❌ **Bad configuration**: P99 latency approaches `time_threshold_ms` (all batches timing out)

#### Validated Configurations

**AWS EC2 c5.2xlarge, 3-node cluster, 1000 concurrent clients:**

| Config     | Throughput   | P99 Latency | Result                              |
| ---------- | ------------ | ----------- | ----------------------------------- |
| `100/50ms` | 141K ops/sec | 1.09ms      | Optimal                             |
| `100/10ms` | 138K ops/sec | 1.15ms      | Good for low-latency networks       |
| `120/50ms` | 1.9K ops/sec | 52ms        | ❌ Threshold too high, all timeouts |
| `10/5ms`   | 24K ops/sec  | 5.4ms       | ❌ Window too short                 |

#### Troubleshooting

**Throughput drops to <10K ops/sec**  
→ `size_threshold` too high, reduce to 100-150

**P99 latency ≈ `time_threshold_ms`**  
→ Batches timing out instead of filling, increase timeout or decrease threshold

---

### Implementation Recommendations

1. **For your specific AWS VPC environment**: I recommend disabling compression for `replication_response` and `client_response` as you've already done. This is optimal for same-VPC deployments where network latency is negligible.

2. **Accept Compressed vs Send Compressed**: Always keep `accept_compressed` enabled for all services to maintain compatibility with clients. This has minimal performance impact when no compressed data is received.

3. **Granular Control**: The design allows you to optimize based on actual data patterns - snapshot transfers benefit from compression regardless of environment, while high-frequency operations like replication and client responses perform better without compression in low-latency networks.

This implementation provides a clean, configurable approach that follows Rust best practices and provides clear documentation for users.