1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
//! End-to-end validation of the `OutputFormat::LlmExtract` pipeline.
//!
//! Every test in this file boots a ktstr VM, runs `schbench` inside
//! with the stock [`SCHBENCH`] fixture (which emits its latency
//! tables and summary lines to stderr by default — see
//! `tests/common/fixtures.rs` for the `output = LlmExtract` contract),
//! ships the captured stderr text from the guest to the host across
//! the SHM ring as a `RawPayloadOutput`, and routes it through the
//! local-model extraction pipeline (`extract_via_llm`) on the HOST
//! after VM exit. The host then applies a fixed set of universal
//! invariants against the extracted metrics; any violation folds
//! into the test's `AssertResult` as an `AssertDetail`.
//!
//! Lives in its own integration-test binary (not `ktstr_test_macro.rs`)
//! because exercising the LLM backend pulls in the full model cache
//! — running the ~2.55 GiB `DEFAULT_MODEL` load and a multi-second
//! inference call — and isolating it keeps the cheap scheduler
//! tests free of that cost when filtering via nextest.
//!
//! **Host-only LLM extraction.** The model (~2.55 GiB GGUF) does
//! NOT load inside the guest VM: the test VM's RAM
//! budget cannot fit it, and the cache lives on the host. The
//! guest-side `evaluate()` skips every model code path for
//! `OutputFormat::LlmExtract` payloads, ships the raw
//! stdout/stderr across the SHM ring, and the host's
//! `eval.rs::host_side_llm_extract` runs `extract_via_llm`
//! post-VM-exit. As a consequence, `ctx.payload(&SCHBENCH).run()`
//! returns a `PayloadMetrics` with `metrics: vec![]` inside the
//! guest test body — extraction is deferred. The body therefore
//! cannot inspect individual metrics here. The framework owns the
//! sanity checks below.
//!
//! **Universal structural-sanity checks enforced host-side**:
//! 1. Every metric name is unique (duplicate dotted paths imply
//! the LLM walker emitted the same key twice — a walker
//! aggregation bug or malformed JSON path that would
//! misattribute downstream stats).
//! 2. Every value is finite (no NaN / ±inf leaking into
//! PayloadMetrics).
//! 3. Every metric carries `MetricSource::LlmExtract` (drift here
//! points at a bypass: the value reached the LlmExtract slot
//! without traversing the LLM walker).
//!
//! Workload-specific assertions (minimum metric count, sign,
//! magnitude bounds, semantic ranges) are intentionally NOT
//! enforced at the framework level — those vary per payload
//! (schbench's > 5 latency rows vs a hypothetical single-throughput
//! benchmark, or schbench's non-negative microseconds vs a
//! delta-emitting payload that legitimately reports negative deltas)
//! and require a per-payload validation API that ktstr does not yet
//! expose. See `eval.rs::validate_llm_extraction` for the host-side
//! enforcement.
//!
//! **Stability disclaimer**: passing this test does NOT mean
//! `LlmExtract` output is run-to-run stable for regression
//! comparisons. The invariants above pin only structural sanity,
//! not the extracted values or names. For stable schemas suitable
//! for run-to-run comparison and regression classification, use
//! [`SCHBENCH_JSON`](common::fixtures::SCHBENCH_JSON) with
//! `OutputFormat::Json` — the dotted-path schema lives in
//! schbench's `write_json_stats` and is fixed by the schbench
//! source, independent of the model.
//!
//! Model availability: tests here lazy-load the LLM model on the
//! first `extract_via_llm` invocation (see `load_inference` in
//! `src/test_support/model.rs`). With `KTSTR_MODEL_OFFLINE=1` and a
//! cold cache, the load fails and `host_side_llm_extract` appends an
//! `LlmExtract model load failed` detail. Inference itself is
//! multi-minute on host CPU regardless of cache state, so the
//! `test(model_loaded_)` nextest override at `.config/nextest.toml`
//! extends the slow-timeout to cover EVERY run; a cold cache
//! additionally pays the GGUF download cost on top of that.
use Result;
use SCHBENCH;
use AssertResult;
use ktstr_test;
use Ctx;
/// Run schbench under the [`SCHBENCH`] fixture and validate the
/// [`OutputFormat::LlmExtract`](ktstr::test_support::OutputFormat::LlmExtract)
/// pipeline.
///
/// `llcs = 1, cores = 2, threads = 1, memory_mb = 2048`: schbench
/// wants at least one messenger + one worker thread, so two logical
/// CPUs is the minimum topology that gives it room to measure wake
/// latency. 2048 MiB memory_mb matches the other in-VM benchmark
/// tests and leaves headroom for schbench's 2 MiB per-thread shm
/// allocations plus kernel overhead.
///
/// Test body shape: returns the `AssertResult` from
/// `ctx.payload(&SCHBENCH).run()` directly. The metric-set sanity
/// checks live host-side in `eval.rs::host_side_llm_extract` —
/// extraction is deferred until after VM exit because the model
/// does not fit in guest RAM. See the module doc for the universal
/// invariants the host applies.