skill-inject 0.9.0

skill-inject: local semantic auto-injection of agent skills
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
#!/usr/bin/env python3
"""probe-compliance.py — measure whether a real Claude Code host actually
invokes a skill ski injects, A/B/C-style, across scenarios / models / library
sizes / conversation starting points.

This is the live-model companion to `examples/eval` (which scores *ranking*,
not compliance): it answers "given the injection, does the model call the
skill?" — the question no offline eval can.

Scenarios (see SCENARIOS) each pair a project fixture with one target skill
and an *indirect* prompt — one a capable host plausibly hand-rolls instead of
reaching for the skill. Every scenario project has all scenario skills
installed, plus `FILLERS` extra plausible distractor skills, so the native
chooser faces a menu of the size you choose (4 + FILLERS).

Axes (env vars, comma-separated lists; the script runs the full cross-product
of what you pass — keep batches narrow to control cost):
  MODELS=sonnet            models to probe (haiku|sonnet|opus|...)
  FILLERS=0                extra distractor skills installed (0 -> menu of 4)
  STARTS=fresh             fresh: injection on turn 1 of a new session;
                           continue: an unrelated overview turn runs first,
                           then the probe prompt arrives as turn 2 (--resume)
  SCENARIOS=uv,csv,...     subset of scenario ids
  N=3                      runs per cell

Conditions per cell:
  A  baseline — skills installed, no injection (the native chooser alone)
  B  ski's directive line, firm verb, no evidence clause
  C  same directive + the evidence clause ("[matched because ...]")

Verdict per run: did a `Skill` tool_use for the *target* skill appear; was it
the model's first action; was a wrong skill invoked. A zero-tool transcript can
be a transient API error — check it before counting a decline.

Every run is a real model call; a cell is N calls (2N for STARTS=continue).

RESULTS — read the caveats. Runs before 2026-07-03's harness fix shared a
mutable fixture across concurrent runs (a later run could open a project an
earlier run had already fixed) and used a skill-salience-leaking warm turn, so
treat pre-fix numbers as noisy directionals, not measurements. Corrected runs
(per-run pristine fixtures, neutral warm turn, `consumed` column) are the
authority.

CORRECTED, 2026-07-03, sonnet, N=3/cell, transient API-error cells re-run,
zero wrong-skill invocations — invoked / as-first-action:
  menu of 50 skills, fresh session:
    csv     A 3/3 2/3   B 3/3 3/3   C 3/3 3/3
    rust    A 1/3 0/3   B 3/3 2/3   C 3/3 1/3
    docker  A 2/3 1/3   B 3/3 1/3   C 3/3 1/3
    totals  A 6/9 3/9   B 9/9 6/9   C 9/9 5/9
  continuation (neutral warm turn, menu 25), uv scenario:
    cont    A 3/3 0/3   B 3/3 1/3   C 3/3 0/3
Corrected readings:
  (1) The directive's recall value is real and shows exactly where the README
      predicts: indirect, conceptual matches under menu pressure. At 50 skills
      the baseline drops to 6/9 overall and 1/3 on rust (hand-rolled the whole
      refactor without touching the skill); either directive form restores 9/9.
      csv stays saturated — the prompt names data.csv, so the native chooser
      doesn't need help there.
  (2) Bare directive vs evidence clause: indistinguishable in corrected data
      (both 9/9 invoked; first-action B 6/9 vs C 5/9). The immediacy edge the
      pre-fix sweeps showed for the evidence clause did NOT reproduce once
      fixtures were isolated; treat it as unproven. The clause has never hurt
      a run across ~60 C-cells and is kept for transparency, not for measured
      compliance lift.
  (3) Zero wrong-skill invocations across all ~150 runs, all batches: firm
      injection does not confuse the model into invoking distractors.

Pre-fix directionals (contaminated-fixture caveat applies): 4-scenario menu-4
grid showed A 10/12 / B 12/12 / C 12/12 invoked; model sweep showed haiku and
opus both 3/3 in every condition on uv (menu 4).
"""

import concurrent.futures
import itertools
import json
import os
import shutil
import subprocess
import sys
import tempfile
import uuid

MODELS = os.environ.get("MODELS", os.environ.get("MODEL", "sonnet")).split(",")
FILLERS = [int(x) for x in os.environ.get("FILLERS", "0").split(",")]
STARTS = os.environ.get("STARTS", "fresh").split(",")
N = int(os.environ.get("N", "3"))
MAX_TURNS = os.environ.get("MAX_TURNS", "6")
WORKERS = int(os.environ.get("WORKERS", "4"))
RUN_TIMEOUT = int(os.environ.get("RUN_TIMEOUT", "420"))
CONDS = os.environ.get("CONDS", "A,B,C").split(",")

# Continuation warm turn: deliberately neutral and tool-free. An earlier version
# asked for a project overview, but that made the model inventory .claude/skills
# on turn 1, leaking skill salience into turn 2 and saturating every condition —
# the warm turn must build conversation context without advertising the skills.
WARM_PROMPT = "quick question before we start: in one sentence, what does HTTP status 429 mean?"

# Keep in sync with the header src/inject.rs emits.
HEADER = (
    "ski matched these skills to your request — a dedicated retrieval+rerank pass, "
    "separate from and complementary to the host's own skill selection. Invoke "
    "fitting ones by name via the `Skill` tool; do not Read the files. Prefer "
    "invoking a matching skill over doing its task by hand; skip a "
    "recommendation only if it clearly does not apply:"
)

SCENARIOS = [
    {
        "id": "uv",
        "skill": "uv-development",
        "description": (
            "Manage Python projects with uv — add dependencies, sync environments, "
            "run tools, and keep pyproject.toml healthy. Use when working in a "
            "Python project that uses uv."
        ),
        "body": """# uv project development

Rules for dependency work in uv projects:

1. Never edit `pyproject.toml` dependencies by hand — always use `uv add <package>` (or `uv add --dev`) so the lockfile stays consistent.
2. After adding, run `uv lock --check` to verify the lockfile.
3. Pin only with constraints (`uv add "requests>=2.31"`), never exact pins unless asked.
4. Never run `pip` inside a uv project.
""",
        "files": {
            "uv.lock": "",
            "pyproject.toml": '[project]\nname = "probe-proj"\nversion = "0.1.0"\nrequires-python = ">=3.12"\ndependencies = []\n',
            "src/main.py": "import requests\nprint(requests.__version__)\n",
        },
        "prompt": "running src/main.py fails with ModuleNotFoundError: No module named 'requests' — fix it",
        "evidence": "you are working in a uv project",
    },
    {
        "id": "csv",
        "skill": "tabular-hygiene",
        "description": (
            "Clean and transform CSV/spreadsheet data — deduplicate rows, normalize "
            "headers, validate types. Use when the user wants tabular data cleaned "
            "or reshaped."
        ),
        "body": """# Tabular data hygiene

Rules for cleaning tabular files:

1. Never overwrite the original file — write the cleaned copy alongside it (`<name>.clean.csv`).
2. Use Python's csv module (or pandas), never shell awk/sed one-liners.
3. Report row counts before and after every transformation.
4. Normalize headers to snake_case.
""",
        "files": {
            "data.csv": "Order ID,Customer Name,ammount\n1,alice,10\n1,alice,10\n2,bob,20\n2,bob,20\n3,carol,30\n",
        },
        "prompt": "data.csv has duplicate rows and inconsistent headers — clean it up",
        "evidence": "a file of this skill's document type is part of this conversation",
    },
    {
        "id": "rust",
        "skill": "rust-error-handling",
        "description": (
            "Idiomatic Rust error handling — Result types, custom error enums, no "
            "unwrap in library code. Use when making Rust code robust to failures."
        ),
        "body": """# Rust error handling

Rules for robustness work in Rust:

1. No `unwrap()`/`expect()` in library code — return `Result` with a custom error enum.
2. Propagate with `?`; convert foreign errors via `From` impls.
3. Panics are only for violated invariants, never for bad input.
4. Add a test for each failure path you introduce.
""",
        "files": {
            "Cargo.toml": '[package]\nname = "probe"\nversion = "0.1.0"\nedition = "2021"\n',
            "src/lib.rs": (
                "pub fn parse_config(s: &str) -> (String, u16) {\n"
                "    let mut parts = s.split(':');\n"
                "    let host = parts.next().unwrap().to_string();\n"
                "    let port: u16 = parts.next().unwrap().parse().unwrap();\n"
                "    (host, port)\n"
                "}\n"
            ),
        },
        "prompt": "parse_config in src/lib.rs panics on malformed input — make it robust",
        "evidence": "you are working in a rust project",
    },
    {
        "id": "docker",
        "skill": "dockerfile-slim",
        "description": (
            "Optimize Dockerfiles — multi-stage builds, minimal base images, layer "
            "caching, non-root users. Use when reducing container image size or "
            "build time."
        ),
        "body": """# Dockerfile slimming

Rules for image optimization:

1. Always convert to a multi-stage build: build stage with the toolchain, runtime stage on `-slim`/`alpine`/distroless.
2. Combine apt/apk lines and clean caches in the same layer.
3. COPY only what the runtime needs — never `COPY . .` into the final stage.
4. Add a non-root USER in the final stage.
""",
        "files": {
            "Dockerfile": (
                "FROM ubuntu:22.04\n"
                "RUN apt-get update\n"
                "RUN apt-get install -y python3 python3-pip build-essential curl git vim\n"
                "COPY . .\n"
                "RUN pip3 install -r requirements.txt\n"
                'CMD ["python3", "app.py"]\n'
            ),
            "requirements.txt": "flask\n",
            "app.py": "print('hi')\n",
        },
        "prompt": "our container image is 1.2GB — slim it down",
        "evidence": "you are working in a docker project",
    },
]

# Plausible distractor skills, none relevant to any scenario prompt. The first
# FILLERS of these are installed alongside the 4 scenario skills so the native
# chooser faces a realistic wall of descriptions (FILLERS=46 -> 50 skills).
FILLER_SKILLS = [
    ("react-performance", "Diagnose and fix React render performance — memoization, virtualization, profiler traces. Use when a React UI feels slow."),
    ("terraform-modules", "Structure reusable Terraform modules with clean variable contracts and remote state. Use when refactoring infrastructure code."),
    ("k8s-debugging", "Debug Kubernetes workloads — crashloops, OOMKills, pending pods, service DNS. Use when a cluster deployment misbehaves."),
    ("graphql-schema", "Design GraphQL schemas — pagination, nullability, versioning-free evolution. Use when adding or reviewing GraphQL APIs."),
    ("oauth-flows", "Implement OAuth 2.0 and OIDC flows correctly — PKCE, token refresh, session handling. Use when wiring up third-party login."),
    ("i18n-setup", "Internationalize an application — string extraction, locale files, pluralization rules. Use when adding multi-language support."),
    ("a11y-audit", "Audit and fix web accessibility — ARIA roles, focus order, contrast, screen readers. Use when improving accessibility compliance."),
    ("sql-migrations", "Write safe SQL schema migrations — reversible steps, zero-downtime patterns, backfills. Use when changing database schemas."),
    ("redis-caching", "Add Redis caching layers — key design, TTLs, stampede protection, invalidation. Use when reducing database load."),
    ("kafka-streams", "Build Kafka consumers and producers — partitioning, offsets, exactly-once semantics. Use when working with event streams."),
    ("grpc-services", "Define and evolve gRPC services — proto hygiene, deadlines, retries, streaming. Use when building RPC APIs."),
    ("css-layout", "Solve CSS layout problems — grid, flexbox, container queries, sticky positioning. Use when wrestling with page layout."),
    ("webpack-tuning", "Speed up webpack/vite builds — code splitting, tree shaking, cache configuration. Use when frontend builds are slow."),
    ("git-bisect", "Hunt regressions with git bisect — scripted bisection, skip handling, narrowing ranges. Use when finding the commit that broke something."),
    ("gh-actions-ci", "Author GitHub Actions workflows — matrices, caching, reusable workflows, secrets. Use when setting up or fixing CI."),
    ("monorepo-tooling", "Manage monorepos — task graphs, affected-only builds, dependency boundaries. Use when scaling a multi-package repository."),
    ("api-versioning", "Version REST APIs without breaking clients — deprecation windows, header negotiation. Use when evolving public endpoints."),
    ("logging-hygiene", "Structure application logs — levels, correlation ids, PII scrubbing, sampling. Use when improving observability."),
    ("feature-flags", "Roll out features behind flags — targeting rules, kill switches, flag debt cleanup. Use when shipping risky changes gradually."),
    ("secrets-rotation", "Rotate credentials safely — dual-write windows, dependency mapping, zero-downtime swaps. Use when rotating keys or tokens."),
    ("unit-test-style", "Write focused unit tests — arrange/act/assert, test doubles, parameterization. Use when improving test suites."),
    ("e2e-playwright", "Build reliable Playwright end-to-end tests — locators, fixtures, flake control. Use when automating browser testing."),
    ("load-testing", "Design load tests — workload modeling, ramp profiles, latency percentiles. Use when validating performance under traffic."),
    ("cpu-profiling", "Profile CPU hotspots — flamegraphs, sampling vs instrumentation, inlining effects. Use when code is compute-bound."),
    ("memory-leaks", "Track down memory leaks — heap snapshots, retention paths, fragmentation. Use when a process grows without bound."),
    ("email-templates", "Build responsive HTML email templates that survive Outlook and Gmail quirks. Use when designing transactional email."),
    ("stripe-billing", "Integrate Stripe billing — subscriptions, webhooks, proration, dunning. Use when adding payments to a product."),
    ("webhook-design", "Design webhook systems — signatures, retries, idempotency, ordering. Use when publishing events to third parties."),
    ("cron-scheduling", "Schedule background jobs — cron expressions, jitter, overlap locks, catch-up runs. Use when adding periodic tasks."),
    ("pdf-reports", "Generate PDF reports programmatically — layout engines, pagination, embedded fonts. Use when producing printable documents."),
    ("image-optimization", "Optimize web images — responsive srcsets, AVIF/WebP, lazy loading, CDNs. Use when improving page weight."),
    ("video-transcode", "Transcode video with ffmpeg — codecs, bitrate ladders, HLS packaging. Use when processing uploaded video."),
    ("ml-notebooks", "Keep Jupyter notebooks reproducible — seeded runs, parameterization, papermill. Use when maturing exploratory ML work."),
    ("prompt-engineering", "Structure LLM prompts — system/user separation, few-shot examples, output contracts. Use when building AI features."),
    ("data-warehouse", "Model data warehouse tables — star schemas, slowly changing dimensions, dbt tests. Use when building analytics models."),
    ("airflow-dags", "Author Airflow DAGs — idempotent tasks, backfills, sensors, SLAs. Use when orchestrating data pipelines."),
    ("spark-jobs", "Tune Spark jobs — partitioning, shuffles, skew mitigation, memory config. Use when big-data jobs are slow or failing."),
    ("mobile-release", "Ship mobile releases — store metadata, staged rollouts, crash monitoring. Use when releasing iOS/Android builds."),
    ("electron-packaging", "Package Electron apps — code signing, auto-update, notarization. Use when distributing desktop builds."),
    ("cli-ux", "Design command-line interfaces — flags vs subcommands, progress output, exit codes. Use when building CLI tools."),
    ("semver-releases", "Cut semantic-version releases — changelogs, tags, breaking-change policy. Use when versioning a library."),
    ("changelog-hygiene", "Maintain useful changelogs — keep-a-changelog format, user-facing language. Use when documenting releases."),
    ("license-compliance", "Audit dependency licenses — copyleft obligations, attribution files, SBOMs. Use when reviewing third-party code."),
    ("threat-modeling", "Run lightweight threat models — STRIDE, trust boundaries, mitigations backlog. Use when reviewing a design for security."),
    ("incident-runbooks", "Write actionable incident runbooks — detection, triage steps, rollback procedures. Use when preparing on-call documentation."),
    ("docs-style", "Write clear technical documentation — task-oriented structure, examples first. Use when authoring or editing docs."),
]


def directive(sc, with_evidence):
    ev = f" [matched because {sc['evidence']}]" if with_evidence else ""
    line = f"- SkillRecommendation(`{sc['skill']}`): {sc['description']}{ev} — invoke it now, before you respond."
    return HEADER + "\n\n" + line


def setup_fixture(root, sc, fillers):
    """Create the project fixture for (scenario, filler-count); returns its base dir."""
    base = os.path.join(root, f"{sc['id']}-f{fillers}")
    proj = os.path.join(base, "proj")
    os.makedirs(proj, exist_ok=True)
    for rel, content in sc["files"].items():
        path = os.path.join(proj, rel)
        os.makedirs(os.path.dirname(path) or proj, exist_ok=True)
        with open(path, "w") as f:
            f.write(content)
    skills = [(s["skill"], s["description"], s["body"]) for s in SCENARIOS]
    skills += [(n, d, f"# {n}\n\nDetailed guidance for {n.replace('-', ' ')}.\n")
               for n, d in FILLER_SKILLS[:fillers]]
    for name, desc, body in skills:
        sdir = os.path.join(proj, ".claude", "skills", name)
        os.makedirs(sdir, exist_ok=True)
        with open(os.path.join(sdir, "SKILL.md"), "w") as f:
            f.write(f"---\nname: {name}\ndescription: {desc}\n---\n\n{body}")
    for cond in ["B", "C"]:
        inj = {
            "hookSpecificOutput": {
                "hookEventName": "UserPromptSubmit",
                "additionalContext": directive(sc, with_evidence=(cond == "C")),
            }
        }
        with open(os.path.join(base, f"inject-{cond}.json"), "w") as f:
            json.dump(inj, f)
        settings = {
            "hooks": {
                "UserPromptSubmit": [
                    {"hooks": [{"type": "command", "command": f"cat {base}/inject-{cond}.json"}]}
                ]
            }
        }
        with open(os.path.join(base, f"settings-{cond}.json"), "w") as f:
            json.dump(settings, f)
    with open(os.path.join(base, "settings-A.json"), "w") as f:
        json.dump({}, f)
    return base


def claude_cmd(prompt, model, settings, extra):
    return [
        "claude", "-p", prompt,
        "--model", model,
        "--settings", settings,
        "--max-turns", MAX_TURNS,
        "--allowedTools", "Skill,Read,Bash,Edit,Write",
        "--output-format", "stream-json", "--verbose",
    ] + extra


def run_one(root, sc, model, fillers, start, cond, i):
    base = os.path.join(root, f"{sc['id']}-f{fillers}")
    tag = f"{model}-f{fillers}-{start}-{cond}-{i}"
    # Every run gets a pristine copy of the fixture. Runs mutate the project
    # (that is the task), so a shared directory lets one run solve the problem
    # for the next — which silently invalidates later cells.
    proj = os.path.join(base, f"run-{tag}")
    shutil.copytree(os.path.join(base, "proj"), proj)
    out_path = os.path.join(base, f"out-{tag}.jsonl")
    settings = os.path.join(base, f"settings-{cond}.json")
    with open(os.devnull) as devnull:
        try:
            extra = []
            if start == "continue":
                # Turn 1: an unrelated, injection-free turn establishes a
                # conversation; the probe prompt then arrives as turn 2, so the
                # directive lands mid-session rather than on a fresh context.
                sid = str(uuid.uuid4())
                warm = claude_cmd(WARM_PROMPT, model,
                                  os.path.join(base, "settings-A.json"),
                                  ["--session-id", sid])
                warm[warm.index("--max-turns") + 1] = "3"
                subprocess.run(warm, cwd=proj, stdin=devnull,
                               stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL,
                               timeout=RUN_TIMEOUT)
                extra = ["--resume", sid]
            cmd = claude_cmd(sc["prompt"], model, settings, extra)
            with open(out_path, "w") as out:
                subprocess.run(cmd, cwd=proj, stdin=devnull, stdout=out,
                               stderr=subprocess.DEVNULL, timeout=RUN_TIMEOUT)
        except subprocess.TimeoutExpired:
            pass
    shutil.rmtree(proj, ignore_errors=True)  # transcripts live in base/
    return tag


def tools_of(path):
    tools = []
    try:
        lines = open(path).read().splitlines()
    except OSError:
        return tools
    for line in lines:
        try:
            v = json.loads(line)
        except ValueError:
            continue
        if v.get("type") == "assistant":
            for b in v.get("message", {}).get("content", []):
                if b.get("type") == "tool_use":
                    tools.append((b["name"], json.dumps(b.get("input", {}))))
    return tools


def score(root, scenarios):
    all_skills = {s["skill"] for s in SCENARIOS} | {n for n, _ in FILLER_SKILLS}
    print(f"\n{'cell':<26} {'run':<4} {'target?':<8} {'first':<7} {'wrong-skill':<16} sequence")
    totals = {}
    for sc, model, fillers, start in itertools.product(scenarios, MODELS, FILLERS, STARTS):
        base = os.path.join(root, f"{sc['id']}-f{fillers}")
        for cond in CONDS:
            for i in range(1, N + 1):
                tag = f"{model}-f{fillers}-{start}-{cond}-{i}"
                tools = tools_of(os.path.join(base, f"out-{tag}.jsonl"))
                target = any(n == "Skill" and sc["skill"] in d for n, d in tools)
                # ski's own observe hook counts a Read of the SKILL.md as a use
                # (`Read|Skill` matcher) — score that as "consumed" so a model
                # that reads the file instead of invoking is not a false miss.
                read_md = any(n == "Read" and f"{sc['skill']}/SKILL.md" in d for n, d in tools)
                consumed = target or read_md
                wrong = sorted({s for n, d in tools if n == "Skill"
                                for s in all_skills - {sc["skill"]} if s in d})
                first = tools[0][0] if tools else "-"
                first_is_target = bool(tools) and tools[0][0] == "Skill" and sc["skill"] in tools[0][1]
                key = (model, fillers, start, cond)
                t = totals.setdefault(key, [0, 0, 0, 0])
                t[0] += target
                t[1] += first_is_target
                t[2] += 1
                t[3] += consumed
                seq = "->".join(n for n, _ in tools[:6])
                print(f"{sc['id'] + ' ' + tag:<26} {i:<4} {str(target):<8} {first:<7} "
                      f"{','.join(wrong) or '-':<16} {seq}")
    print(f"\n{'model':<8} {'menu':<6} {'start':<10} {'cond':<5} {'invoked':<9} {'consumed':<10} first-action")
    for (model, fillers, start, cond), (inv, first, tot, cons) in sorted(totals.items()):
        print(f"{model:<8} {4 + fillers:<6} {start:<10} {cond:<5} {inv}/{tot:<7} {cons}/{tot:<8} {first}/{tot}")


def main():
    only = os.environ.get("SCENARIOS")
    scenarios = SCENARIOS if not only else [s for s in SCENARIOS if s["id"] in only.split(",")]
    if not scenarios:
        sys.exit(f"no scenarios match SCENARIOS={only}")
    root = tempfile.mkdtemp(prefix="ski-probe.")
    jobs = list(itertools.product(scenarios, MODELS, FILLERS, STARTS, CONDS, range(1, N + 1)))
    print(f"probe dir: {root}")
    print(f"models={MODELS} fillers={FILLERS} starts={STARTS} "
          f"scenarios={[s['id'] for s in scenarios]} conds={CONDS} N={N} -> {len(jobs)} cells")
    for sc, fillers in {(s["id"], f) for s in scenarios for f in FILLERS}:
        setup_fixture(root, next(s for s in scenarios if s["id"] == sc), fillers)
    done = 0
    with concurrent.futures.ThreadPoolExecutor(max_workers=WORKERS) as ex:
        futs = [ex.submit(run_one, root, sc, m, f, st, c, i)
                for sc, m, f, st, c, i in jobs]
        for fut in concurrent.futures.as_completed(futs):
            done += 1
            print(f"  [{done}/{len(jobs)}] {fut.result()} done", flush=True)
    score(root, scenarios)
    print(f"\ntranscripts kept in {root}")


if __name__ == "__main__":
    main()