treetop-core 0.0.17

# Performance Benchmarks

This project includes two benchmark systems for the `PolicyEngine::evaluate()` hot path:

- **Criterion** for wall-clock latency/throughput trends
- **iai-callgrind** for instruction-level deterministic regression detection

Both are configured with a scenario matrix that stresses key dimensions:

- policy-set size (small/medium/large)
- allow vs deny paths
- group cardinality
- label-registry complexity
- namespace depth
- observability enabled/disabled

## Bench Files

- `benches/evaluate_common.rs` - shared scenario matrix + fixture builder
- `benches/evaluate_criterion_baseline.rs` - Criterion baseline scenarios
- `benches/evaluate_criterion_groups.rs` - Criterion group-heavy scenarios
- `benches/evaluate_criterion_labels.rs` - Criterion label-heavy scenarios
- `benches/evaluate_criterion_namespaced.rs` - Criterion namespaced scenarios
- `benches/evaluate_iai_baseline.rs` - iai-callgrind baseline scenarios
- `benches/evaluate_iai_groups.rs` - iai-callgrind group-heavy scenarios
- `benches/evaluate_iai_labels.rs` - iai-callgrind label-heavy scenarios
- `benches/evaluate_iai_namespaced.rs` - iai-callgrind namespaced scenarios

## Run Locally

### All benchmarks (everything)

```bash
cargo bench
```

With observability enabled:

```bash
cargo bench --features observability
```

> Note: `iai-callgrind` requires Linux + Valgrind. On macOS, prefer running Criterion benches and use CI for `iai-callgrind`.

### Criterion (default features)

```bash
cargo bench --bench evaluate_criterion_baseline -- --noplot
```

Replace `evaluate_criterion_baseline` with `evaluate_criterion_groups`, `evaluate_criterion_labels`, or `evaluate_criterion_namespaced` to run those slices.

### Criterion (observability enabled)

```bash
cargo bench --bench evaluate_criterion_baseline --features observability -- --noplot
```

### iai-callgrind (default features)

Requires:

- `valgrind`
- `iai-callgrind-runner` (install with `cargo install --locked iai-callgrind-runner`)

> Note: `iai-callgrind-runner` is Linux only because it depends on Valgrind/Callgrind.

```bash
cargo bench --bench evaluate_iai_baseline
```

Replace `evaluate_iai_baseline` with `evaluate_iai_groups`, `evaluate_iai_labels`, or `evaluate_iai_namespaced` to run those slices.

### iai-callgrind (observability enabled)

```bash
cargo bench --bench evaluate_iai_baseline --features observability
```

### Recommended local workflow by platform

- **macOS:** Run Criterion locally (`cargo bench --bench evaluate_criterion_baseline ...`) and use CI for `iai-callgrind`.
- **Linux:** Run both Criterion and `iai-callgrind` locally.

## Criterion Regression Compare

Use the helper script to compare two Criterion result directories:

```bash
python3 scripts/perf/compare_criterion.py <base_target_dir> <head_target_dir> <max_regression_pct>
```

Example:

```bash
python3 scripts/perf/compare_criterion.py /tmp/criterion-base-no-obs /tmp/criterion-head-no-obs 8
```

The script exits non-zero if any scenario regresses more than the threshold.

## CI Layout

Workflow: `.github/workflows/perf.yml`

- **criterion-regression** (gating):
  - checks out base commit in a worktree
  - runs `evaluate_criterion_*` benches on base and head
  - compares means using `scripts/perf/compare_criterion.py`
  - fails if any scenario exceeds `PERF_MAX_REGRESSION_PCT` (default `8`)
- **iai-callgrind-regression**:
  - runs base as saved baseline (`--save-baseline base`)
  - runs head against that baseline (`--baseline base`)
  - executes with observability on/off matrix
  - posts a PR comment with a short summary and output tail for each matrix variant

## Recommended Repo Workflow

- Protect `main` and require pull requests for changes.
- Require Perf workflow checks to pass before merge.
- Use PR-to-`main` as the primary performance regression gate.
- Keep direct pushes to `main` disabled except for maintainers/emergency flow.

## Tuning Guidance

- Start with a looser threshold (for example 8-10%) and tighten after a few weeks of data.
- Prefer adding new scenarios only when they map to real production-like request shapes.
- Keep matrix entries stable to preserve trend comparability over time.