# Benchmarking Lessence
Lessence uses [criterion.rs](https://github.com/bheisler/criterion.rs)
for performance benchmarks. The bench suite is designed as a **local
perf gate**, not a CI gate — `examples/` (the bench corpus) is
gitignored, so CI has no corpus to run against.
When you change hot-path or flush-path code, record a baseline
before the change and compare after. Any regression above **3%** on
text-mode end-to-end throughput is a blocking ship bug. Stretch target
is **0%**.
## Prerequisites
- The `examples/` directory must contain the six Tier 1 corpus files
(see below). They are gitignored and distributed separately.
- For lowest-noise measurements, run on a quiet machine: close
browsers, pause background builds, plug in power (disable battery
throttling), and avoid thermal throttling if possible.
## Bench suite overview
| `folder_e2e` | `thread_count=Some(1)`, `min_collapse=3` | **Canonical gate.** Full `process_line → finish()` cost, single-threaded text mode. This is the bench whose baseline matters most. |
| `folder_streaming_only` | `thread_count=Some(1)`, `min_collapse=usize::MAX` | Subtraction-method `T_stream`: disables collapsed formatting so per-group flush is minimised. Used with `folder_e2e` to see where regressions land (streaming vs flush). |
| `folder_parallel_e2e` | `thread_count=None` | Parallel-mode sanity check. Same corpus, rayon auto-detect. Catches regressions in the parallel join path. |
| `normalize_line` | per-line `Normalizer::normalize_line` | Pattern-detection micro-bench. Isolates regressions in the normalizer before they dilute folder-level benches. |
All four share the same Tier 1 corpus for apples-to-apples comparison.
## Tier 1 corpus
Six files under `examples/`. Each exercises a different pattern mix:
| `kubelet.log` | 20 MB | K8s: UUIDs, paths, kubernetes tokens, high volume |
| `argocd_controller_production.log` | 4 MB | Structured app logs, high token density |
| `harbor_postgres_primary.log` | 7.5 MB | Postgres: duration/timestamp/number-heavy (count-only dominant) |
| `openssh_brute_force.log` | 4 MB | One-template pathological case — stress-tests single-group flush |
| `apache_error_production.log` | 1.9 MB | Apache: brackets, quoted strings |
| `nginx_sample.log` | 6.5 KB | Tiny — fixed-overhead floor |
`epyc_7days_journalctl.log` is **not** in the Tier 1 gate. Its pattern
density (1000+ distinct templates in the first 5k lines) trips the
`should_flush_buffer()` cap at `src/folder.rs:439`, which contaminates
the subtraction method. Journalctl coverage is deferred to the Phase 5
manual sweep (`docs/rollup-calibration.md`).
## Recording a baseline
On a quiet machine:
```bash
cargo bench --bench folder_e2e -- --save-baseline main-pre-feature
cargo bench --bench folder_streaming_only -- --save-baseline main-pre-feature
cargo bench --bench folder_parallel_e2e -- --save-baseline main-pre-feature
cargo bench --bench normalize_line -- --save-baseline main-pre-feature
```
Baselines are stored under `target/criterion/<bench>/<id>/base/` —
gitignored via `target/`.
**For ground-truth baselines**, run the full suite three times across
different machine states (cold cache, warm cache, after a short idle)
and keep the median. Criterion's sample_size=50 and 30s measurement
window handle per-run noise, but machine-level wobble (thermal,
background load) is outside criterion's control.
## Comparing against a baseline
After a change:
```bash
cargo bench --bench folder_e2e -- --baseline main-pre-feature
```
Criterion reports each result as one of:
- `No change in performance detected.` — under the 2% noise threshold.
- `Change within noise threshold.` — statistically plausible noise.
- `Performance has improved.` — green, carry on.
- `Performance has regressed.` — **inspect the `%` change carefully**.
### Gate rule
| ≤ +3% | Pass. Ship. |
| > +3% and ≤ +5% *only* on `openssh_brute_force.log` | Flagged worst case. Justify or autoresearch. |
| > +3% on any other Tier 1 file | **Fail.** Invoke autoresearch; do not merge. |
See `.claude/plans/harmonic-churning-karp.md` for the full gate
criteria per implementation phase.
## Configuration knobs
Each bench configures criterion with these settings (defined inline
in each bench file):
```rust
group.sample_size(50); // default is 100
group.measurement_time(Duration::from_secs(30));
group.warm_up_time(Duration::from_secs(5));
group.noise_threshold(0.02); // 2% = "no change"
group.significance_level(0.01); // 1% false-positive rate
```
These values were picked to push the noise floor below the 3% gate
threshold. If you need to iterate faster during development, override
at the CLI level — but note that criterion's group-level settings
**override** CLI flags like `--sample-size`, so you'll need to edit
the bench source for fast-iteration loops.
## Throughput output
Each bench reports throughput in MiB/s via `group.throughput(Bytes(...))`.
This is a more informative number than raw latency for a regression
gate — "the bench is 3% slower" is equivalent to "throughput dropped
3%", and MiB/s is the number you write in a PR body.
## Single-threaded mode is the control
`thread_count == Some(1)` takes an early branch in
`PatternFolder::process_line()` at `src/folder.rs:171`. It avoids
rayon entirely — no thread pool, no batching. This makes it the
deterministic bench control: no cross-thread scheduling noise, no
batch-size artifacts.
The parallel bench (`folder_parallel_e2e`) is a sanity check for the
rayon path, not a throughput claim. Compare it against its own
baseline, not against single-threaded numbers.
## What to look for on a regression
1. **Which bench regressed?** — `folder_e2e` is the gate;
`folder_streaming_only` and `folder_parallel_e2e` tell you *where*.
2. **Which file regressed?** — a regression on one file and not
others usually points to a pattern-detector or token-type-specific
issue.
3. **Does `normalize_line` regress too?** — if yes, the issue is in
the pattern-detection path, not the folder.
4. **Does `folder_streaming_only` regress?** — if yes, the regression
is in the per-line hot path, which is unexpected for flush-time
features. If no, the regression is in flush/finish, which is where
you'd expect rollup-related changes to show up.
5. **Profile.** Run `cargo bench --bench folder_e2e -- --profile-time
30` to get per-function samples. Don't guess about where the cost
lives.
## Running a subset
Criterion accepts a substring filter as a positional argument:
```bash
cargo bench --bench folder_e2e -- nginx_sample # only the tiny file
cargo bench --bench folder_e2e -- kubelet # only the kubelet input
cargo bench --bench folder_streaming_only -- postgres # substring match
```
## Cleaning up baselines
Criterion baselines accumulate under `target/criterion/`. To delete
a specific named baseline:
```bash
find target/criterion -type d -name "main-pre-feature" -exec rm -rf {} +
```
Or nuke all criterion state:
```bash
rm -rf target/criterion/
```