# Performance
> Baseline numbers measured on a developer workstation (x86_64
> Linux, recent stable Rust, `--release`). Your mileage will
> vary; the *relationships* between configurations matter more
> than the absolute numbers.
flowscope ships a criterion-driven bench harness under `benches/`.
Each group exercises one layer of the pipeline so future perf
work has a baseline to regress-test against. Numbers here are a
point-in-time snapshot from the 0.3.0 release; re-run locally
with:
```sh
cargo bench --all-features
cargo bench --all-features --bench tracker # just one group
```
Criterion writes HTML reports to `target/criterion/`; open
`target/criterion/index.html` to compare runs.
## Methodology
- One bench file per layer: `extractor`, `tracker`,
`reassembler`, `session_driver`, `dedup`.
- Each `bench_function` measures one hot-path call; criterion
iterates ~5 seconds with statistical analysis.
- Throughput reported in nanoseconds per call. Multiply by your
packets-per-second target to estimate CPU load.
## Baseline (0.3.0)
### Extractor
| `extractor/five_tuple_ipv4_tcp` | ~115 ns |
| `extractor/five_tuple_ipv4_udp` | ~110 ns |
`FiveTuple` parsing is the floor on every other layer — every
packet pays this once. Modern x86 does ~9M parses/sec/core.
### Tracker (Plan 41 hot-cache validation)
| `tracker/monoflow` | ~315 ns | 1.00x |
| `tracker/n_flows/10` | ~330 ns | 1.05x |
| `tracker/n_flows/100` | ~340 ns | 1.08x |
| `tracker/n_flows/1000` | ~375 ns | 1.19x |
| `tracker/n_flows/10000` | ~450 ns | 1.43x |
The hot-cache fast path (Plan 41) is observable: monoflow is
~43% faster than the 10k-flow round-robin case where every
packet misses the cache. Real-world traffic is bursty per flow,
so the per-burst stickiness recovers most of the win on
heterogeneous workloads.
The original Plan 41 "~2× on monoflow" estimate was optimistic;
the actual win is ~1.4× when comparing the cache-hit path
against a 10k-flow cache-miss baseline. Plenty good for the
~50 LOC cost.
### Reassembler
| `reassembler/in_order_1500_uncapped` | ~87 ns |
| `reassembler/in_order_1500_capped_1m` | ~87 ns |
| `reassembler/sliding_window_overflow` | ~50–100 ns (varies) |
| `reassembler/ooo_drops` | ~25 ns |
| `reassembler/drop_flow_poisoned` | ~5 ns |
The cap check (Plan 42 §1) costs nothing measurable on the
under-cap hot path — confirms the design goal. The poisoned-
reassembler path is essentially free (just a flag check + early
return) because the underlying segment is dropped without
buffer manipulation.
### Dedup (Plan 49)
| `dedup/unique_64` | ~860 ns | small-frame hash + lookup |
| `dedup/unique_1500` | ~1.2 µs | typical-MTU hash + lookup |
| `dedup/duplicate_1500` | ~1.2 µs | match-and-drop path |
Most of the cost is the `ahash` of the frame bytes. For loopback
captures running at ~1 Gbps with 1500-byte MTU (~80k pps),
that's ~80k × 1.2 µs = ~100 ms/sec of CPU — about 10% of one
core. Acceptable for the bug-class it prevents.
### Session driver
| `session_driver/passthrough` | ~500–800 ns |
End-to-end cost of `FlowSessionDriver::track` with a no-op
`SessionParser`. Dominated by tracker + reassembler dispatch +
the per-side drain loop.
## Reading the numbers
- **Don't optimise without measuring.** Re-run locally on your
target hardware before assuming flowscope is the bottleneck.
- **The bench is the regression detector.** If a future change
shows up as ≥10% slower in `cargo bench`, investigate before
shipping. criterion's HTML reports diff against the previous
run automatically.
- **Per-call vs per-packet.** All numbers above are *per call*
to the named function. Many real packets trigger multiple
calls (extract → tracker.track → reassembler.segment, etc.).
## Running benchmarks
```sh
# All groups:
cargo bench --all-features
# One group:
cargo bench --all-features --bench tracker
# Compare against a saved baseline (criterion auto-saves):
cargo bench --all-features --bench tracker -- --save-baseline before
# ... make changes ...
cargo bench --all-features --bench tracker -- --baseline before
# Stress run with extra iterations:
cargo bench --all-features --bench tracker -- --warm-up-time 5 --measurement-time 30
```
## Future perf work
Areas a future plan could investigate, ordered by potential
impact:
1. **Zero-copy reassembly via `BytesMut` pool** in netring's
async path. ~80k allocs/sec/Gbps eliminated. Estimated 1.5–
2.5× throughput on TCP-heavy workloads. (netring-side work,
not flowscope.)
2. **Faster hashing for `Dedup`.** xxhash3 (no-std,
~zero-cost) is faster than ahash on large frames. Cost: one
new dep. Win: maybe ~30% on `dedup/unique_1500`.
3. **HashMap shard / dashmap for `FlowTracker`.** Only relevant
if profiling shows the LruCache as a contention point under
multi-thread access (currently not the model — flowscope is
sync, parallelism happens outside).
4. **SIMD header parsing.** etherparse is already fast; SIMD
wins are real but marginal at our packet sizes. Skip unless
real evidence surfaces.