Skip to main content

Module bench

Module bench 

Source
Expand description

Pillar 3 / Stream E — ai-memory bench workload runner.

Measures hot-path operations against the budgets published in PERFORMANCE.md and returns p50/p95/p99 latencies plus a pass/fail verdict per operation. The CI guard (Stream F) enforces the same 10% p95 tolerance documented in PERFORMANCE.md.

Coverage in this build:

  • Embedding-free CRUD: memory_store (no embedding), memory_search (FTS5), memory_recall (hot, depth=1).
  • Knowledge-graph traversal:
    • memory_kg_query (depth=1) and memory_kg_timeline against a fan-out fixture (50 sources × 4 outbound links each, every link valid_from-stamped).
    • memory_kg_query (depth=3, depth=5) against a chain fixture (50 chains × 5 hops each = 300 memories + 250 links). depth=3 hits the “depth ≤ 3” 100 ms budget bucket; depth=5 hits the “depth ≤ 5” 250 ms tail-case bucket.

Both fixtures live in the same in-process disposable SQLite — no external service required.

Embedding-bound paths (memory_store with embedding, memory_recall cold/full hybrid) still require an embedder process and are tracked as follow-up Stream E work — they don’t belong on the hot path of a cargo test invocation.

Structs§

BaselineRecord
Subset of OperationResult retained when loading a previous run for --baseline comparison. Only the fields the regression check actually consumes are required, so any superset of those fields (the full bench --json output included) deserializes cleanly.
BenchConfig
OperationResult
Regression
Per-operation regression row produced by compare_against_baseline.
ScaleBudgets
#1579 B8 — one row of the per-scale p95 budget table published in PERFORMANCE.md §“Corpus-scale budgets”. Only the three corpus-sensitive operations carry scale-specific budgets; the KG operations run against fixed-size fixtures (50×4 fan-out, 50×5 chains) whose cost is independent of the seeded corpus scale, so they keep their canonical budgets at every scale.

Enums§

Operation
Hot-path operations covered by this iteration of the bench tool.
Status

Constants§

BENCH_NAMESPACE
Default seeded namespace for the bench workload.
CI_SCALE_GATE_ROWS
#1579 B8 — canonical corpus scale (rows) for the scale-gate run (ai-memory bench --scale 10000). The P1 perf-audit proved the default workload (~500 rows after per-op seeding) cannot see corpus-scale budget blowouts (recall p95 361 ms vs the 50 ms budget at 100k rows was invisible to the built-in bench).
DEFAULT_ITERATIONS
Default workload size — keep small enough for cargo test, large enough that p99 has signal.
DEFAULT_REGRESSION_THRESHOLD_PCT
Default tolerance applied when comparing a fresh run against a --baseline JSON file: a measured p95 may grow by this percentage before the run is flagged as a regression. Independent of P95_TOLERANCE (which guards against the absolute budget). The baseline guard catches drift that stays inside the absolute budget but trends in the wrong direction across releases.
DEFAULT_WARMUP
Default warmup iterations discarded from the percentile sample.
MACOS_BUDGET_MULT
MAX_ITERATIONS
Hard ceiling on --iterations — bounds bench wall-clock on a mistyped flag.
MAX_REGRESSION_THRESHOLD_PCT
Hard ceiling on --regression-threshold (percent) — values above this are clamped; a 1000% allowance already means “no gate”.
MAX_SCALE
#1579 B8 — hard ceiling on --scale rows. Bounds seeding wall-clock
MAX_WARMUP
Hard ceiling on --warmup iterations.
P95_TOLERANCE
CI guard tolerance — measured p95 may exceed budget by this factor before the run is marked Fail. Mirrors PERFORMANCE.md.
SCALE_BUDGETS
#1579 B8 — the per-scale budget table (SSOT; PERFORMANCE.md §“Corpus-scale budgets” narrates these numbers and the operation_scale_targets_match_performance_md test pins them).

Functions§

append_history
Append a benchmark result to a JSONL history file. Creates the file and parent directories if missing. Each line is a self-describing JSON object with captured_at, iterations, warmup, and results array.
compare_against_baseline
Compare a fresh run against a baseline. Operations missing from the baseline are skipped silently (e.g. a new bench row added since the baseline was captured). The returned Vec preserves the order of current and only includes ops present in both.
load_baseline
Load a previously emitted bench --json payload from disk.
render_regression_table
Render a regression table to a string, mirroring the layout of render_table.
render_table
Render a results table to a string in the same shape used in the PERFORMANCE.md “Operator Self-Verification” example.
run
Run the bench workload and return per-operation results.
scale_budgets_for
#1579 B8 — resolve the budget row for a requested scale: the first table row whose scale >= requested, else the largest pinned row (best-effort; pin a new table row before gating larger scales).