d-engine 0.2.3

Lightweight Raft consensus engine - recommended entry point for most users
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
# Raft Throughput Optimization Guide for Tonic gRPC in Rust

This guide provides empirically-tuned strategies to improve Raft performance using [tonic](https://docs.rs/tonic/latest/tonic/) with connection type isolation and configurable persistence. These optimizations address critical bottlenecks in consensus systems where network transport and disk I/O impact throughput and latency.

## Connection Type Strategy

We implement three distinct connection types to prevent head-of-line blocking:

| **Type**  | **Purpose**          | **Critical Operations**           | **Performance Profile**    |
| --------- | -------------------- | --------------------------------- | -------------------------- |
| `Control` | Consensus operations | Heartbeats, Votes, Config changes | Latency-sensitive (sub-ms) |
| `Data`    | Log replication      | AppendEntries, Log writes         | Throughput-optimized       |
| `Bulk`    | Large transfers      | Snapshot streaming                | Bandwidth-intensive        |

```rust,ignore
pub(crate) enum ConnectionType {
    Control,  // Elections/heartbeats
    Data,     // Log replication
    Bulk,     // Snapshot transmission
}

```

## Persistence Strategy & Throughput/Latency Trade-offs

The choice of persistence strategy (`DiskFirst` vs. `MemFirst`) and `FlushPolicy` is the primary mechanism for balancing write throughput against write latency and durability guarantees.

### Strategy Configuration

Configure the strategy in your `config.toml`:

```toml
[raft.persistence]
# Choose based on your priority: durability (DiskFirst) or low-latency writes (MemFirst)
# DEFAULT in v0.2.3+: DiskFirst (Raft protocol compliance)
strategy = "DiskFirst"  # Default in v0.2.3+
# For benchmark/development: strategy = "MemFirst"

# Flush policy works in conjunction with the chosen strategy.
flush_policy = { Batch = { threshold = 1000, interval_ms = 100 } }
# flush_policy = "Immediate"

```

### `DiskFirst` Strategy (High Durability)

- **Write Path**: An entry is written to disk _before_ being acknowledged and added to the in-memory store.
- **Latency/Throughput**: Higher, more consistent write latency per operation. Maximum throughput may be limited by disk write speed.
- **Durability**: Strong. Acknowledged entries are always on disk. No data loss on process crash.
- **Use Case**: Systems where data integrity is the absolute highest priority, even at the cost of higher write latency.

### `MemFirst` Strategy (High Throughput / Low Latency)

- **Write Path**: An entry is immediately added to the in-memory store and acknowledged. Disk persistence happens asynchronously.
- **Latency/Throughput**: Very low write latency and high throughput, as writes are batched for disk I/O.
- **Durability**: Eventual. A process crash can lose the most recently acknowledged entries that haven't been flushed yet.
- **Use Case**: Systems requiring high write performance and low latency, where a small window of potential data loss is acceptable.

### `FlushPolicy` Tuning

The flush policy controls how asynchronous (`MemFirst`) writes are persisted to disk.

- **`Immediate`**: Flush every write synchronously. Use with `MemFirst` to approximate `DiskFirst` durability (but with different failure semantics) or with `DiskFirst` for the strongest guarantees. Highest durability, lowest performance.
- **`Batch { threshold, interval_ms }`**:
  - `threshold`: Flush when this many entries are pending. Larger values increase throughput but also the window of potential data loss.
  - `interval_ms`: Flush after this much time has passed, even if the threshold isn't met. Limits the maximum staleness of data on disk.

**Recommendation**: For most production use cases, `MemFirst` with a `Batch` policy offers the best balance of performance and durability. Tune the `threshold` based on your acceptable data loss window (RPO) and the `interval_ms` based on your crash recovery tolerance.

## Configuration Tuning

### Control Plane (`[network.control]`)

```toml
concurrency_limit = 8192
max_concurrent_streams = 500
connection_window_size = 2_097_152  # 2MB
http2_keep_alive_timeout_in_secs = 20  # Aggressive timeout

```

**Why it matters:**

- Ensures election timeouts aren't missed during load
- Prevents heartbeat delays that cause unnecessary leader changes
- Uses smaller windows for faster roundtrips

### Data Plane (`[network.data]`)

```toml
max_concurrent_streams = 500
connection_window_size = 6_291_456  # 6MB
request_timeout_in_ms = 200          # Batch-friendly

```

**Optimization rationale:**

- Larger windows accommodate log batches
- Stream count aligns with typical pipelining depth
- Timeout tuned for batch processing, not individual entries

### Bulk Plane (`[network.bulk]` - Recommended)

```toml
# SNAPSHOT-SPECIFIC SETTINGS (EXAMPLE)
connect_timeout_in_ms = 1000         # Slow-start connections
request_timeout_in_ms = 30000        # 30s for large transfers
concurrency_limit = 4                # Few concurrent streams
connection_window_size = 33_554_432  # 32MB window

```

**Snapshot considerations:**

- Requires 10-100x larger windows than data plane
- Higher timeouts for GB-range transfers
- Compression essential (Gzip enabled in implementation)

## Critical Code Implementation

### Connection Type Routing

```rust,ignore
// Control operations
membership.get_peer_channel(peer_id, ConnectionType::Control)

// Data operations
membership.get_peer_channel(peer_id, ConnectionType::Data)

// Snapshot transfers
membership.get_peer_channel(leader_id, ConnectionType::Bulk)

```

### gRPC Server Tuning

```rust,ignore
tonic::transport::Server::builder()
    .timeout(Duration::from_millis(control_config.request_timeout_in_ms))
    .max_concurrent_streams(control_config.max_concurrent_streams)
    .max_frame_size(Some(data_config.max_frame_size))
    .initial_connection_window_size(data_config.connection_window_size)

```

## Performance Results (Optimization Impact)

| **Metric**    | **Before**  | **After**   | **Delta**  |
| ------------- | ----------- | ----------- | ---------- |
| Throughput    | 368 ops/sec | 373 ops/sec | +1.3%      |
| p99 Latency   | 5543 µs     | 4703 µs     | **-15.2%** |
| p99.9 Latency | 14015 µs    | 11279 µs    | -19.5%     |

> **Key improvement**: 15% reduction in tail latency - critical for consensus stability  
> **Note**: These metrics show the impact of connection pooling optimization. These results can be further improved by tuning the PersistenceStrategy for your specific workload.
>
> For absolute performance benchmarks, see [v0.2.3 Performance Report]https://github.com/deventlab/d-engine/tree/main/benches/reports/v0.2.3/bench_report_v0.2.3.md

## Operational Recommendations

1. **Pre-warm connections** during node initialization
2. **Monitor connection types separately**:

   ```bash
   # Control plane
   netstat -an | grep ":9081" | grep ESTABLISHED | wc -l

   # Data plane
   netstat -an | grep ":9082" | grep ESTABLISHED

   ```

3. **Size bulk windows** for snapshot sizes:

   ```rust,ignore
   connection_window_size = max_snapshot_size * 1.2

   ```

4. **Compress snapshots**:

   ```rust,ignore
   .send_compressed(CompressionEncoding::Gzip)
   .accept_compressed(CompressionEncoding::Gzip)

   ```

5. **Monitor Flush Lag**: When using `MemFirst`, monitor the difference between `last_log_index` and `durable_index`. A growing gap indicates the disk is not keeping up with writes, increasing potential data loss.

## Anti-Patterns to Avoid

```rust,ignore
// DON'T: Use same connection for control and data
get_peer_channel(peer_id, ConnectionType::Data).await?;
client.request_vote(...)  // Control operation on data channel

// DO: Strict separation
get_peer_channel(peer_id, ConnectionType::Control).await?;
client.request_vote(...)

// DON'T: Use MemFirst with Immediate flush for high-throughput scenarios
// This eliminates the performance benefit of MemFirst.
[strategy = "MemFirst"]
flush_policy = "Immediate" // Anti-pattern

// DO: Use a Batch policy to amortize disk I/O cost.
[strategy = "MemFirst"]
flush_policy = { Batch = { threshold = 1000, interval_ms = 100 } }

```

## Why Connection Isolation and Strategy Choice Matters

1. **Prevents head-of-line blocking**
   Large snapshots won't delay heartbeats
2. **Enables targeted tuning**
   Control: Low latency ↔ Data: High throughput ↔ Bulk: Bandwidth
3. **Improves fault containment**
   Connection issues affect only one operation type
4. **Decouples Performance from Durability**
   The `PersistenceStrategy` allows you to choose the right trade-off between fast writes (`MemFirst`) and guaranteed durability (`DiskFirst`) for your use case.

## Reference Deployment Configurations

Below are example configurations for different deployment scenarios.
Adjust values based on snapshot size, log append rate, and cluster size.

### 1. Single Node (Local Dev / Testing)

- CPU: 4 cores
  • Memory: 8 GB
  • Network: Localhost

```toml
[raft.persistence]
strategy = "DiskFirst"  # Default in v0.2.3+
# For benchmark/development: strategy = "MemFirst"
flush_policy = { Batch = { threshold = 500, interval_ms = 50 } }

[network.control]
concurrency_limit = 10
max_concurrent_streams = 64
connection_window_size = 1_048_576   # 1MB

[network.data]
concurrency_limit = 20
max_concurrent_streams = 128
connection_window_size = 2_097_152   # 2MB

[network.bulk]
concurrency_limit = 2
connection_window_size = 8_388_608   # 8MB
request_timeout_in_ms = 10_000       # 10s

```

**Tip**: Single-node setups focus on low resource usage; bulk window size can be smaller since snapshots are local. For developer iteration speed, consider using `strategy = "MemFirst"`.

### 2. 3-Node Public Cloud Cluster (Medium Durability)

- Instance Type: 4 vCPU / 16 GB RAM (e.g., AWS m6i.large, GCP n2-standard-4)
  • Network: 10 Gbps
  • Priority: Balanced throughput and durability

```toml
[raft.persistence]
strategy = "DiskFirst"  # Default in v0.2.3+
# For high-throughput scenarios: strategy = "MemFirst"
flush_policy = { Batch = { threshold = 2000, interval_ms = 200 } }

[network.control]
concurrency_limit = 20
max_concurrent_streams = 128
connection_window_size = 2_097_152   # 2MB

[network.data]
concurrency_limit = 30
max_concurrent_streams = 256
connection_window_size = 4_194_304   # 4MB

[network.bulk]
concurrency_limit = 4
connection_window_size = 33_554_432  # 32MB
request_timeout_in_ms = 30_000       # 30s for multi-GB snapshots

```

**Tip**: For public cloud, moderate concurrency and 32MB bulk windows ensure stable snapshot streaming without affecting heartbeats. The batch policy is tuned for high throughput with a reasonable data loss window.

### 3. 5-Node High-Durability Cluster (Production)

- Instance Type: 8 vCPU / 32 GB RAM (e.g., AWS m6i.xlarge)
  • Network: 25 Gbps
  • Priority: Data Integrity over Write Latency

```toml
[raft.persistence]
strategy = "DiskFirst" # Highest durability guarantee
# flush_policy is less critical for DiskFirst but can help with batching internal operations.

[network.control]
concurrency_limit = 50
max_concurrent_streams = 256
connection_window_size = 4_194_304   # 4MB

[network.data]
concurrency_limit = 80
max_concurrent_streams = 512
connection_window_size = 8_388_608   # 8MB

[network.bulk]
concurrency_limit = 8
connection_window_size = 67_108_864  # 64MB
request_timeout_in_ms = 60_000       # 60s for large snapshots

```

**Tip**: Use `DiskFirst` for systems where data integrity is non-negotiable (e.g., financial systems). Be aware that write throughput will be bound by disk I/O latency.

## Network Environment Tuning Recommendations

These parameters are primarily **network-dependent**, not CPU/memory dependent.

Adjust them based on latency, packet loss, and connection stability.

| **Environment**                  | **tcp_keepalive_in_secs** | **http2_keep_alive_interval_in_secs** | **http2_keep_alive_timeout_in_secs** | **max_frame_size** | **Notes**                                                                |
| -------------------------------- | ------------------------- | ------------------------------------- | ------------------------------------ | ------------------ | ------------------------------------------------------------------------ |
| **Local / In-Cluster (LAN)**     | 60                        | 10                                    | 5                                    | 16_777_215 (16MB)  | Low latency & stable; defaults are fine                                  |
| **Cross-Region / Stable WAN**    | 60                        | 15                                    | 8                                    | 16_777_215 (16MB)  | Slightly longer keep-alive to avoid false disconnects                    |
| **Public Cloud / Moderate Loss** | 60                        | 20                                    | 10                                   | 33_554_432 (32MB)  | Higher interval & timeout for lossy links; larger frame helps batch logs |
| **High Latency / Unstable WAN**  | 120                       | 30                                    | 15                                   | 33_554_432 (32MB)  | Longer timeouts prevent spurious drops                                   |

**Guidelines:**

1. Keep-alive interval ≈ 1/3 of timeout.
2. Increase `max_frame_size` only if batch logs or snapshots exceed 16MB.
3. High-latency WAN: favor fewer reconnects over aggressive failure detection.
4. These settings are independent of CPU and memory; focus on network RTT and stability.

### RPC Timeout Guidance

`connect_timeout_in_ms` and `request_timeout_in_ms` depend on **network latency and I/O**, not CPU or memory.

| **Environment**                  | **connect_timeout_in_ms** | **request_timeout_in_ms** | **Notes**                                             |
| -------------------------------- | ------------------------- | ------------------------- | ----------------------------------------------------- |
| **Local / In-Cluster (LAN)**     | 50–100                    | 100–300                   | Very low RTT; fast retries                            |
| **Cross-Region / Stable WAN**    | 200–500                   | 300–1000                  | Higher RTT, moderate batch sizes                      |
| **Public Cloud / Moderate Loss** | 500–1000                  | 1000–5000                 | Compensate for packet loss and I/O latency            |
| **High Latency / Unstable WAN**  | 1000+                     | 5000+                     | Favor fewer reconnects; allow large batch replication |

**Tips:**

1. `connect_timeout_in_ms` covers TCP+TLS+gRPC handshake; increase for high-latency links.
2. `request_timeout_in_ms` should accommodate log batches and disk write delays on followers.
3. Timeouts mainly depend on **network RTT** and **disk I/O**, not hardware compute.

## Optimizing gRPC Compression for Performance

The d-engine now supports granular control of gRPC compression settings per service type, allowing you to fine-tune your deployment for optimal performance based on your specific environment.

### Granular Compression Control

```toml
# Example configuration for AWS VPC environment
[raft.rpc_compression]
replication_response = false  # High-frequency, disable for CPU optimization
election_response = true      # Low-frequency, minimal CPU impact
snapshot_response = true      # Large data volume, benefits from compression
cluster_response = true       # Configuration data, benefits from compression
client_response = false       # Improves client read/write performance
```

### Performance Impact

Our benchmarks show that disabling compression for high-frequency operations (replication and client requests) can yield significant performance improvements in low-latency environments:

| Scenario     | CPU Savings | Throughput Improvement |
| ------------ | ----------- | ---------------------- |
| Same-AZ VPC  | 15-20%      | 30-40%                 |
| Cross-AZ VPC | 5-10%       | 10-15%                 |
| Cross-Region | -5% to -10% | -20% to -30%           |

Note: In cross-region deployments, enabling compression for all traffic types is generally beneficial due to bandwidth constraints.

### Read Consistency and Compression

The ReadConsistencyPolicy (`LeaseRead`, `LinearizableRead`, `EventualConsistency`) works in conjunction with compression settings. For maximum performance:

1. Use `EventualConsistency` when possible for non-critical reads
2. Combine with `client_response = false` for lowest latency
3. Longer `lease_duration_ms` with `LeaseRead` reduces network round-trips

Example configuration for high-throughput read operations:

```toml
[raft.read_consistency]
default_policy = "LeaseRead"
lease_duration_ms = 500  # Longer lease duration = better performance
allow_client_override = true

[raft.rpc_compression]
client_response = false  # Optimize client read performance

```

This combination provides strong consistency with minimal network overhead and no compression CPU penalty.

---

### Linearizable Read Batching Configuration

LinearizableRead batches requests to amortize consensus overhead. The key principle: **`size_threshold` should trigger quickly under load; `time_threshold` is the safety net for low concurrency.**

#### Configuration Parameters

```toml
[raft.read_consistency.read_batching]
size_threshold = 100      # Trigger point for high concurrency
time_threshold_ms = 50    # Fallback timeout for low concurrency
```

#### Core Tuning Principle

**`size_threshold` = "Large enough to batch effectively, small enough to trigger frequently"**

- **Too small (10-50)**: Batches too often → wasted overhead
- **Sweet spot (100-200)**: Triggers within 1-2ms under load → optimal
- **Too large (300+)**: Rarely reached → all batches wait for timeout → performance collapse

#### Quick Start Guide

**Step 1: Set `time_threshold_ms` based on network latency**

| Environment         | RTT    | Recommended `time_threshold_ms` |
| ------------------- | ------ | ------------------------------- |
| Local/LAN           | <1ms   | 10-20                           |
| AWS Same-Region VPC | 1-2ms  | 40-60                           |
| Cross-Region/WAN    | 5-10ms | 80-120                          |

**Step 2: Keep `size_threshold` in safe range**

```toml
size_threshold = 100-150  # Works for most workloads
```

**Step 3: Validate with benchmarks**

**Good configuration**: Most batches flush via `size_threshold`, not timeout  
❌ **Bad configuration**: P99 latency approaches `time_threshold_ms` (all batches timing out)

#### Validated Configurations

**AWS EC2 c5.2xlarge, 3-node cluster, 1000 concurrent clients:**

| Config     | Throughput   | P99 Latency | Result                              |
| ---------- | ------------ | ----------- | ----------------------------------- |
| `100/50ms` | 141K ops/sec | 1.09ms      | Optimal                             |
| `100/10ms` | 138K ops/sec | 1.15ms      | Good for low-latency networks       |
| `120/50ms` | 1.9K ops/sec | 52ms        | ❌ Threshold too high, all timeouts |
| `10/5ms`   | 24K ops/sec  | 5.4ms       | ❌ Window too short                 |

#### Troubleshooting

**Throughput drops to <10K ops/sec**  
→ `size_threshold` too high, reduce to 100-150

**P99 latency ≈ `time_threshold_ms`**  
→ Batches timing out instead of filling, increase timeout or decrease threshold

---

### Implementation Recommendations

1. **For your specific AWS VPC environment**: I recommend disabling compression for `replication_response` and `client_response` as you've already done. This is optimal for same-VPC deployments where network latency is negligible.

2. **Accept Compressed vs Send Compressed**: Always keep `accept_compressed` enabled for all services to maintain compatibility with clients. This has minimal performance impact when no compressed data is received.

3. **Granular Control**: The design allows you to optimize based on actual data patterns - snapshot transfers benefit from compression regardless of environment, while high-frequency operations like replication and client responses perform better without compression in low-latency networks.

This implementation provides a clean, configurable approach that follows Rust best practices and provides clear documentation for users.