Expand description
Pillar 3 / Stream E — ai-memory bench workload runner.
Measures hot-path operations against the budgets published in
PERFORMANCE.md and returns p50/p95/p99 latencies plus a pass/fail
verdict per operation. The CI guard (Stream F) enforces the same
10% p95 tolerance documented in PERFORMANCE.md.
Coverage in this build:
- Embedding-free CRUD:
memory_store(no embedding),memory_search(FTS5),memory_recall(hot, depth=1). - Knowledge-graph traversal:
memory_kg_query(depth=1) andmemory_kg_timelineagainst a fan-out fixture (50 sources × 4 outbound links each, every linkvalid_from-stamped).memory_kg_query(depth=3, depth=5) against a chain fixture (50 chains × 5 hops each = 300 memories + 250 links). depth=3 hits the “depth ≤ 3” 100 ms budget bucket; depth=5 hits the “depth ≤ 5” 250 ms tail-case bucket.
Both fixtures live in the same in-process disposable SQLite — no
external service required.
Embedding-bound paths (memory_store with embedding,
memory_recall cold/full hybrid) still require an embedder process
and are tracked as follow-up Stream E work — they don’t belong on
the hot path of a cargo test invocation.
Structs§
- Baseline
Record - Subset of
OperationResultretained when loading a previous run for--baselinecomparison. Only the fields the regression check actually consumes are required, so any superset of those fields (the fullbench --jsonoutput included) deserializes cleanly. - Bench
Config - Operation
Result - Regression
- Per-operation regression row produced by
compare_against_baseline. - Scale
Budgets - #1579 B8 — one row of the per-scale p95 budget table published in
PERFORMANCE.md§“Corpus-scale budgets”. Only the three corpus-sensitive operations carry scale-specific budgets; the KG operations run against fixed-size fixtures (50×4 fan-out, 50×5 chains) whose cost is independent of the seeded corpus scale, so they keep their canonical budgets at every scale.
Enums§
Constants§
- BENCH_
NAMESPACE - Default seeded namespace for the bench workload.
- CI_
SCALE_ GATE_ ROWS - #1579 B8 — canonical corpus scale (rows) for the scale-gate run
(
ai-memory bench --scale 10000). The P1 perf-audit proved the default workload (~500 rows after per-op seeding) cannot see corpus-scale budget blowouts (recall p95 361 ms vs the 50 ms budget at 100k rows was invisible to the built-in bench). - DEFAULT_
ITERATIONS - Default workload size — keep small enough for
cargo test, large enough that p99 has signal. - DEFAULT_
REGRESSION_ THRESHOLD_ PCT - Default tolerance applied when comparing a fresh run against a
--baselineJSON file: a measured p95 may grow by this percentage before the run is flagged as a regression. Independent ofP95_TOLERANCE(which guards against the absolute budget). The baseline guard catches drift that stays inside the absolute budget but trends in the wrong direction across releases. - DEFAULT_
WARMUP - Default warmup iterations discarded from the percentile sample.
- MACOS_
BUDGET_ MULT - MAX_
ITERATIONS - Hard ceiling on
--iterations— bounds bench wall-clock on a mistyped flag. - MAX_
REGRESSION_ THRESHOLD_ PCT - Hard ceiling on
--regression-threshold(percent) — values above this are clamped; a 1000% allowance already means “no gate”. - MAX_
SCALE - #1579 B8 — hard ceiling on
--scalerows. Bounds seeding wall-clock - MAX_
WARMUP - Hard ceiling on
--warmupiterations. - P95_
TOLERANCE - CI guard tolerance — measured p95 may exceed budget by this factor
before the run is marked
Fail. MirrorsPERFORMANCE.md. - SCALE_
BUDGETS - #1579 B8 — the per-scale budget table (SSOT;
PERFORMANCE.md§“Corpus-scale budgets” narrates these numbers and theoperation_scale_targets_match_performance_mdtest pins them).
Functions§
- append_
history - Append a benchmark result to a JSONL history file.
Creates the file and parent directories if missing.
Each line is a self-describing JSON object with
captured_at,iterations,warmup, andresultsarray. - compare_
against_ baseline - Compare a fresh run against a baseline. Operations missing from the
baseline are skipped silently (e.g. a new bench row added since the
baseline was captured). The returned
Vecpreserves the order ofcurrentand only includes ops present in both. - load_
baseline - Load a previously emitted
bench --jsonpayload from disk. - render_
regression_ table - Render a regression table to a string, mirroring the layout of
render_table. - render_
table - Render a results table to a string in the same shape used in the
PERFORMANCE.md“Operator Self-Verification” example. - run
- Run the bench workload and return per-operation results.
- scale_
budgets_ for - #1579 B8 — resolve the budget row for a requested scale: the first
table row whose
scale >= requested, else the largest pinned row (best-effort; pin a new table row before gating larger scales).