bytes-handoff 1.0.0

# Bench Harness

Script-driven end-to-end workloads for benchmarking `bytes-handoff` under a
content-routed stream shape.

## Why Not `benches/`?

The Criterion benches in `../benches/` isolate individual operations: read
buffering, splitting, write fan-in, and backpressure. This harness is different:

- It drives complete streams through client, proxy, and sink tasks.
- It combines fragmented ingress, routed prefixes, tunnel mode switching,
  preserved tail bytes, and `WriteHandoff` output.
- It emits machine-readable run artifacts for repeated comparison.
- It measures sustained throughput and CPU cost for a workload shape, not one
  operation in isolation.

## Quick Start

```bash
./bench/run-stream-harness.sh --scenario fragmented --runs 5
./bench/run-stream-harness.sh --scenario coalesced --runs 5
./bench/run-stream-harness.sh --transport tcp --scenario fragmented --runs 5
./bench/run-stream-harness.sh --transport tcp --implementation manual_vec --scenario fragmented --runs 5
./bench/run-stream-harness.sh --scenario fragmented --completion fire_and_forget --runs 5
./bench/run-stream-harness.sh --scenario all --worker-threads 8 --connections 128 --runs 5
```

Each run writes `handoff-run-*.txt` files and a `summary.csv` under
`bench/results/stream_<transport>_<implementation>_<scenario>_<completion>_<timestamp>/`.

For a decision matrix across implementation, arrival pattern, frame size, and
concurrency:

```bash
./bench/run-stream-matrix.sh
```

The matrix is configured with environment variables so the run shape is explicit:

```bash
IMPLEMENTATIONS="handoff manual_vec raw_copy" \
SCENARIOS="fragmented coalesced" \
CONNECTIONS="1 4 16 64 256" \
FRAME_SIZES="64 256 1024 4096 16384 65536 1048576" \
RUNS=3 \
./bench/run-stream-matrix.sh
```

For CPU-cost comparisons on Linux, prefer split-process TCP runs with the proxy
service pinned to different cores than the driver and sink. Example:

```bash
./bench/run-stream-harness.sh \
  --transport tcp \
  --implementation handoff \
  --scenario all \
  --completion fire_and_forget \
  --worker-threads 16 \
  --connections 128 \
  --runs 2 \
  --route-frames 64 \
  --frame-len 63 \
  --tunnel-bytes 1048576 \
  --read-reserve 16384 \
  --duration-seconds 10 \
  --service-cores 0-7,16-23 \
  --driver-cores 8-15,24-31
```

Run the same command with `--implementation manual_vec` and
`--implementation raw_copy` for the comparison table. The service-side logs are
`service-run-*.txt`; the driver-observed logs are `handoff-run-*.txt`.

## Workload

Each simulated connection:

1. Writes many fixed-size route frames in configurable fragments.
2. Sends a `TUNNEL\n` marker.
3. Streams an opaque tunnel payload.
4. Proxies routed prefixes and tunnel bytes through `HandoffBuffer` and
   `WriteHandoff`.
5. Drains a sink and asserts the exact byte count.

Scenarios:

- `duplex`: in-memory transport for low-overhead regression checks.
- `tcp`: localhost TCP transport with client, proxy, and sink sockets.
- `handoff`: `HandoffBuffer` plus `WriteHandoff`; this is the crate path.
- `manual_vec`: `Vec<u8>` parser with direct writes; this is the realistic
  baseline you might write by hand.
- `raw_copy`: unparsed async copy; this is a lower bound, not a semantic peer.
- `fragmented`: small client writes, default 64-byte fragments.
- `coalesced`: larger client writes, default fragment size equal to
  `read_reserve`.

Useful fields in each run output:

- `mib_per_sec`
- `streams_per_sec`
- `cpu_avg_cores`
- `cpu_ns_per_byte`
- `latency_p50_micros`
- `latency_p95_micros`
- `latency_p99_micros`
- `latency_p999_micros`
- `latency_max_micros`
- `voluntary_context_switches`
- `involuntary_context_switches`
- `max_rss_bytes`

Use `--duration-seconds 60` for sustained runs that repeat the connection wave
inside one process. That is the shape to use when watching RSS and allocator
behavior over time.

When reading results:

- `mib_per_sec` in the driver log is end-to-end throughput through the complete
  client/proxy/sink path.
- `mib_per_sec` in the service log is the proxy service's measured throughput.
- `cpu_avg_cores` in the service log is the proxy CPU cost; this is the number
  to compare when asking whether `bytes-handoff` costs more CPU than a manual
  implementation.
- `raw_copy` is a lower bound. It intentionally skips the route parsing and
  owned-prefix handoff work that `handoff` performs.

## Criterion Baseline Discipline

Use the baseline wrapper instead of ad hoc one-second Criterion runs:

```bash
./bench/run-criterion-baseline.sh
```

Defaults are `--sample-size 100 --measurement-time 5 --warm-up-time 3`. On
Linux, set `TASKSET_CORES=2-5` to pin the process and `PERF_STAT=1` to collect
cycles, instructions, cache misses, context switches, and CPU migrations.
Frequency governor and turbo policy still need to be controlled on the host.

## Direct Binary

```bash
cargo run --release --features bench-tools --bin bench_stream_harness -- \
  --transport tcp \
  --implementation handoff \
  --scenario fragmented \
  --worker-threads 8 \
  --connections 64
```