spargio 0.5.13

Work-stealing async runtime for Rust built on io_uring and msg_ring
Documentation
# Performance Guide

This chapter is for tuning your application on Spargio, not for benchmarking Spargio itself.
The main rule: measure your workload shape, then tune based on evidence.

## Measure What Users Feel

Track these together:

- latency (`p50`, `p95`, `p99`) for your user-facing operations
- throughput under realistic concurrency and payload sizes
- error/backpressure behavior under sustained load
- scheduler counters from `RuntimeHandle::stats_snapshot()` (especially locality and steal ratios)

Do not optimize for a single number in isolation.

## Use A Representative Workload

Before touching runtime knobs, make sure you have benchmarks reflecting production traffic shape:

- real request mix (read/write ratio, endpoint mix, fanout patterns)
- realistic payload distributions (not only tiny happy-path payloads)
- realistic concurrency (steady plus burst periods)
- realistic topology (same shard count and affinity strategy you expect in deployment)

If your workload is skewed (hot keys/hot connections), include that skew in your benchmark/workload runs.

## Work-Stealing Controls and Profiling

Tune steal controls only after placement and API-level choices are reasonable.

Primary knobs:

- `RuntimeBuilder::steal_budget(...)`: how much stealable work a shard drains per pass
- `RuntimeBuilder::steal_victim_stride(...)`: how victim scans rotate across shards
- `RuntimeBuilder::steal_victim_probe_count(...)`: how many victims are sampled per scan
- `RuntimeBuilder::steal_batch_size(...)`: maximum tasks moved per successful steal
- `RuntimeBuilder::steal_locality_margin(...)`: how strongly locality is favored over migration
- `RuntimeBuilder::steal_fail_cost(...)`: penalty after repeated low-value scans
- `RuntimeBuilder::steal_backoff_min(...)` / `steal_backoff_max(...)`: adaptive cooldown bounds
- `RuntimeBuilder::stealable_queue_capacity(...)`: enqueue-side backpressure threshold
- `RuntimeBuilder::stealable_queue_backend(...)`: shared-queue backend selection

```rust
#[spargio::main]
async fn main(_handle: spargio::RuntimeHandle) -> Result<(), spargio::RuntimeError> {
    let builder = spargio::Runtime::builder()
        .shards(4)
        .steal_budget(64)
        .steal_victim_stride(2)
        .steal_victim_probe_count(3)
        .steal_batch_size(8)
        .steal_locality_margin(2)
        .steal_fail_cost(3)
        .steal_backoff_min(2)
        .steal_backoff_max(64)
        .stealable_queue_capacity(16_384);

    spargio::run_with(builder, |handle| async move {
        let stats = handle.stats_snapshot();
        println!(
            "local_hit_ratio={:.2} stolen_per_scan={:.2} steal_success_rate={:.2}",
            stats.local_hit_ratio(),
            stats.stolen_per_scan(),
            stats.steal_success_rate(),
        );
    })
    .await
}
```

What this does:

- sets explicit steal-policy values at runtime startup
- runs your workload under those settings
- reads scheduler counters so you can judge whether changes improved locality/throughput tradeoffs

For day-2 monitoring and rollout counters, see [Operations Guide](13_operations_guide.md).

## Profile Before Guessing

Use system profilers to confirm where time and cache misses are going.

Example cachegrind flow:

```bash
cargo build --release
valgrind --tool=cachegrind --cachegrind-out-file=cachegrind.out \
  ./target/release/your_app
cg_annotate cachegrind.out | head -n 120
```

Example callgrind flow:

```bash
valgrind --tool=callgrind --callgrind-out-file=callgrind.out \
  ./target/release/your_app
```

Example Linux perf flow:

```bash
perf stat -d ./target/release/your_app
perf record -g ./target/release/your_app
perf report
```

Use these tools with the same workload driver you use for latency/throughput testing.

## Suggested Tuning Loop

1. capture a baseline on representative load
2. change one thing (placement, knob, buffering, batching, etc.)
3. re-run the same load and compare latency, throughput, and scheduler counters
4. profile with cachegrind/perf if the result is unclear
5. keep only changes that improve your target metrics without hurting critical tails

## Common Mistakes

- tuning steal knobs before verifying placement strategy
- comparing runs with different load shapes or machine settings
- optimizing mean latency while worsening `p95`/`p99`
- making multiple tuning changes at once, then not knowing which one helped

## Further Reading

- The Rust Performance Book: <https://nnethercote.github.io/perf-book/>