# Performance Tuning
RpcNet achieves **172,000+ requests/second** with proper configuration. This chapter provides concrete tips and techniques to maximize performance in production deployments.
## Baseline Performance
Out-of-the-box performance with default settings:
| **Throughput** | 130K-150K RPS | Single director + 3 workers |
| **Latency (P50)** | 0.5-0.8ms | With efficient connection handling |
| **Latency (P99)** | 2-5ms | Under moderate load |
| **CPU (per node)** | 40-60% | At peak throughput |
| **Memory** | 50-100MB | Per worker node |
**Target after tuning**: 172K+ RPS, < 0.5ms P50 latency, < 35% CPU
## Quick Wins
### 1. Optimize Connection Management
**Impact**: Significant throughput increase, reduced latency
```rust
use rpcnet::cluster::ClusterClientConfig;
// Use built-in connection optimization
let config = ClusterClientConfig::default();
```
**Why it works**:
- Efficient connection reuse
- Reduces handshake overhead
- Minimizes connection setup time
### 2. Use Least Connections Load Balancing
**Impact**: 15-20% throughput increase under variable load
```rust
use rpcnet::cluster::{WorkerRegistry, LoadBalancingStrategy};
// Before (Round Robin): uneven load distribution
let registry = WorkerRegistry::new(cluster, LoadBalancingStrategy::RoundRobin);
// After (Least Connections): optimal distribution
let registry = WorkerRegistry::new(cluster, LoadBalancingStrategy::LeastConnections);
```
**Why it works**:
- Prevents overloading individual workers
- Adapts to actual load in real-time
- Handles heterogeneous workers better
### 3. Tune Gossip Interval
**Impact**: 10-15% CPU reduction, minimal latency impact
```rust
use rpcnet::cluster::ClusterConfig;
// Before (default 1s): higher CPU
let config = ClusterConfig::default()
.with_gossip_interval(Duration::from_secs(1));
// After (2s for stable networks): lower CPU
let config = ClusterConfig::default()
.with_gossip_interval(Duration::from_secs(2));
```
**Why it works**:
- Gossip overhead scales with frequency
- Stable networks don't need aggressive gossip
- Failure detection still fast enough (4-8s)
### 4. Increase Worker Pool Size
**Impact**: Linear throughput scaling
```rust
// Before: 3 workers → 150K RPS
// After: 5 workers → 250K+ RPS
// Each worker adds ~50K RPS capacity
```
**Guidelines**:
- Add workers until you hit network/director bottleneck
- Monitor director CPU - scale director if > 80%
- Ensure network bandwidth sufficient
## Detailed Tuning
### Connection Management Optimization
RpcNet handles connection management automatically, but you can optimize for your specific use case:
```rust
use rpcnet::cluster::ClusterClientConfig;
// Default configuration is optimized for most use cases
let config = ClusterClientConfig::default();
```
### QUIC Tuning
#### Stream Limits
```rust
use rpcnet::ServerConfig;
let config = ServerConfig::builder()
.with_max_concurrent_streams(100) // More streams = higher throughput
.with_max_stream_bandwidth(10 * 1024 * 1024) // 10 MB/s per stream
.build();
```
**Guidelines**:
- **max_concurrent_streams**: Set to expected concurrent requests + 20%
- **max_stream_bandwidth**: Set based on your largest message size
#### Congestion Control
```rust
// Aggressive (high-bandwidth networks)
.with_congestion_control(CongestionControl::Cubic)
// Conservative (variable networks)
.with_congestion_control(CongestionControl::NewReno)
// Recommended default
.with_congestion_control(CongestionControl::Bbr) // Best overall
```
### TLS Optimization
#### Session Resumption
```rust
// Enable TLS session tickets for 0-RTT
let config = ServerConfig::builder()
.with_cert_and_key(cert, key)?
.with_session_tickets_enabled(true) // ← Enables 0-RTT
.build();
```
**Impact**: First request after reconnect goes from 2-3 RTT to 0 RTT
#### Cipher Suite Selection
```rust
// Prefer fast ciphers (AES-GCM with hardware acceleration)
.with_cipher_suites(&[
CipherSuite::TLS13_AES_128_GCM_SHA256, // Fast with AES-NI
CipherSuite::TLS13_CHACHA20_POLY1305_SHA256, // Good for ARM
])
```
### Message Serialization
#### Use Efficient Formats
```rust
// Fastest: bincode (binary)
use bincode;
let bytes = bincode::serialize(&data)?;
// Fast: rmp-serde (MessagePack)
use rmp_serde;
let bytes = rmp_serde::to_vec(&data)?;
// Slower: serde_json (human-readable, but slower)
let bytes = serde_json::to_vec(&data)?;
```
**Benchmark** (10KB struct):
| **bincode** | 12 μs | 18 μs | 10240 bytes |
| **MessagePack** | 28 μs | 35 μs | 9800 bytes |
| **JSON** | 85 μs | 120 μs | 15300 bytes |
#### Minimize Allocations
```rust
// ❌ Bad: Multiple allocations
fn build_request(id: u64, data: Vec<u8>) -> Request {
Request {
id: id.to_string(), // Allocation
timestamp: SystemTime::now(),
payload: format!("data-{}", String::from_utf8_lossy(&data)), // Multiple allocations
}
}
// ✅ Good: Reuse buffers
fn build_request(id: u64, data: &[u8], buffer: &mut Vec<u8>) -> Request {
buffer.clear();
buffer.extend_from_slice(b"data-");
buffer.extend_from_slice(data);
Request {
id,
timestamp: SystemTime::now(),
payload: buffer.clone(), // Single allocation
}
}
```
## Platform-Specific Optimizations
### Linux
#### UDP/QUIC Tuning
```bash
# Increase network buffer sizes
sudo sysctl -w net.core.rmem_max=536870912
sudo sysctl -w net.core.wmem_max=536870912
sudo sysctl -w net.ipv4.tcp_rmem='4096 87380 536870912'
sudo sysctl -w net.ipv4.tcp_wmem='4096 87380 536870912'
# Increase UDP buffer (QUIC uses UDP)
sudo sysctl -w net.core.netdev_max_backlog=5000
# Increase connection tracking
sudo sysctl -w net.netfilter.nf_conntrack_max=1000000
# Make permanent: add to /etc/sysctl.conf
```
#### CPU Affinity
```rust
use core_affinity;
// Pin worker threads to specific CPUs
fn pin_to_core(core_id: usize) {
let core_ids = core_affinity::get_core_ids().unwrap();
core_affinity::set_for_current(core_ids[core_id]);
}
// Usage in worker startup
tokio::task::spawn_blocking(|| {
pin_to_core(0); // Pin to CPU 0
// Worker processing logic
});
```
### macOS
#### Increase File Descriptors
```bash
# Check current limits
ulimit -n
# Increase (temporary)
ulimit -n 65536
# Make permanent: add to ~/.zshrc or ~/.bash_profile
echo "ulimit -n 65536" >> ~/.zshrc
```
### Profiling and Monitoring
#### CPU Profiling
```bash
# Install perf (Linux)
sudo apt install linux-tools-common linux-tools-generic
# Profile RpcNet application
sudo perf record -F 99 -a -g -- cargo run --release --bin worker
sudo perf report
# Identify hot paths and optimize
```
#### Memory Profiling
```bash
# Use valgrind for memory analysis
cargo build --release
valgrind --tool=massif --massif-out-file=massif.out ./target/release/worker
# Visualize with massif-visualizer
ms_print massif.out
```
#### Tokio Console
```toml
# Add to Cargo.toml
[dependencies]
console-subscriber = "0.2"
```
```rust
// In main.rs
console_subscriber::init();
// Run application and connect with tokio-console
// cargo install tokio-console
// tokio-console
```
## Benchmarking
### Throughput Test
```rust
use std::time::Instant;
async fn benchmark_throughput(client: Arc<ClusterClient>, duration_secs: u64) {
let start = Instant::now();
let mut count = 0;
while start.elapsed().as_secs() < duration_secs {
match client.call_worker("compute", vec![], Some("role=worker")).await {
Ok(_) => count += 1,
Err(e) => eprintln!("Request failed: {}", e),
}
}
let elapsed = start.elapsed().as_secs_f64();
let rps = count as f64 / elapsed;
println!("Throughput: {:.0} requests/second", rps);
println!("Total requests: {}", count);
println!("Duration: {:.2}s", elapsed);
}
```
### Latency Test
```rust
use hdrhistogram::Histogram;
async fn benchmark_latency(client: Arc<ClusterClient>, num_requests: usize) {
let mut histogram = Histogram::<u64>::new(3).unwrap();
for _ in 0..num_requests {
let start = Instant::now();
let _ = client.call_worker("compute", vec![], Some("role=worker")).await;
let latency_us = start.elapsed().as_micros() as u64;
histogram.record(latency_us).unwrap();
}
println!("Latency percentiles (μs):");
println!(" P50: {}", histogram.value_at_quantile(0.50));
println!(" P90: {}", histogram.value_at_quantile(0.90));
println!(" P99: {}", histogram.value_at_quantile(0.99));
println!(" P99.9: {}", histogram.value_at_quantile(0.999));
println!(" Max: {}", histogram.max());
}
```
### Load Test Script
```rust
// Concurrent load test
async fn load_test(
client: Arc<ClusterClient>,
num_concurrent: usize,
requests_per_task: usize,
) {
let start = Instant::now();
let tasks: Vec<_> = (0..num_concurrent)
.map(|_| {
let client = client.clone();
tokio::spawn(async move {
for _ in 0..requests_per_task {
let _ = client.call_worker("compute", vec![], Some("role=worker")).await;
}
})
})
.collect();
for task in tasks {
task.await.unwrap();
}
let elapsed = start.elapsed().as_secs_f64();
let total_requests = num_concurrent * requests_per_task;
let rps = total_requests as f64 / elapsed;
println!("Load test results:");
println!(" Concurrency: {}", num_concurrent);
println!(" Total requests: {}", total_requests);
println!(" Duration: {:.2}s", elapsed);
println!(" Throughput: {:.0} RPS", rps);
}
```
## Performance Checklist
### Before Production
- [ ] Use default connection management (already optimized)
- [ ] Use Least Connections load balancing
- [ ] Tune gossip interval for your network
- [ ] Configure QUIC stream limits
- [ ] Enable TLS session resumption
- [ ] Profile with release build (`--release`)
- [ ] Test under expected peak load
- [ ] Monitor CPU, memory, network utilization
- [ ] Set up latency tracking (P50, P99, P99.9)
- [ ] Configure OS-level network tuning
### Monitoring in Production
```rust
// Essential metrics to track
metrics::gauge!("rpc.throughput_rps", current_rps);
metrics::gauge!("rpc.latency_p50_us", latency_p50);
metrics::gauge!("rpc.latency_p99_us", latency_p99);
metrics::gauge!("rpc.cpu_usage_pct", cpu_usage);
metrics::gauge!("rpc.memory_mb", memory_mb);
metrics::gauge!("pool.hit_rate", pool_hit_rate);
metrics::gauge!("cluster.healthy_workers", healthy_count);
```
## Troubleshooting Performance Issues
### High Latency
**Symptoms**: P99 latency > 10ms
**Debug**:
```rust
// Add timing to identify bottleneck
let start = Instant::now();
let select_time = Instant::now();
let worker = registry.select_worker(Some("role=worker")).await?;
println!("Worker selection: {:?}", select_time.elapsed());
let connect_time = Instant::now();
let conn = pool.get_or_connect(worker.addr).await?;
println!("Connection: {:?}", connect_time.elapsed());
let call_time = Instant::now();
let result = conn.call("compute", data).await?;
println!("RPC call: {:?}", call_time.elapsed());
println!("Total: {:?}", start.elapsed());
```
**Common causes**:
- Connection management issues (check network configuration)
- Slow workers (check worker CPU/memory)
- Network latency (move closer or add local workers)
### Low Throughput
**Symptoms**: < 100K RPS with multiple workers
**Debug**:
```rust
// Check bottlenecks
println!("Pool metrics: {:?}", pool.metrics());
println!("Worker count: {}", registry.worker_count().await);
println!("Active connections: {}", pool.active_connections());
```
**Common causes**:
- Too few workers (add more)
- Network connectivity issues (check network configuration)
- Director CPU saturated (scale director)
- Network bandwidth limit (upgrade network)
### High CPU Usage
**Symptoms**: > 80% CPU at low load
**Debug**:
```bash
# Profile with perf
sudo perf record -F 99 -a -g -- cargo run --release
sudo perf report
# Look for hot functions
```
**Common causes**:
- Too frequent gossip (increase interval)
- Excessive serialization (optimize message format)
- Inefficient connection handling (use latest RpcNet version)
- Debug build instead of release
## Real-World Results
### Case Study: Video Transcoding Cluster
**Setup**:
- 1 director
- 10 GPU workers
- 1000 concurrent clients
**Before tuning**: 45K RPS, 15ms P99 latency
**After tuning**: 180K RPS, 2ms P99 latency
**Changes**:
1. Used optimized connection management
2. Tuned gossip interval (1s → 2s)
3. Used Least Connections strategy
4. Optimized message serialization (JSON → bincode)
## Next Steps
- **[Production Guide](production.md)** - Deploy optimized clusters
- **[Load Balancing](../cluster/load-balancing.md)** - Strategy selection
## References
- [QUIC Performance](https://datatracker.ietf.org/doc/html/rfc9000) - Protocol optimizations
- [Linux Network Tuning](https://wwwx.cs.unc.edu/~sparkst/howto/network_tuning.php) - OS-level tuning
- [Tokio Performance](https://tokio.rs/tokio/topics/performance) - Async runtime tips