skill-inject 0.9.0

skill-inject: local semantic auto-injection of agent skills
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
# skill-inject — Plan

> Working name: **skill-inject** (binary `ski`). Name is provisional — see Open Questions.

Local-first, model-agnostic **automatic skill injection** for Claude Code and
opencode. A hook embeds the user prompt locally, ranks it against skill
descriptions, and — when a skill is relevant — either **tells the model to load it**
(`directive` mode, v1 default) or **injects the `SKILL.md` body directly**
(`body` mode), with **no cloud call for the decision**. We also track which skills
the **model loaded on its own** so we never re-inject a choice it already made. The
model still chooses which *files* a skill points to; we only guarantee the skill
itself is considered when relevant.

---

## 1. Goal / Non-goals

### Goal
- Make relevant skills appear in context **automatically**, driven by a **local**
  semantic match (no API call to Claude/Haiku for the decision).
- **Respect the model's own choices:** track skills the model already loaded
  (native chooser / `use_skill`) and never re-inject them (§4.8).
- Work for both Claude Code (strong native chooser, used as fallback/refinement)
  and opencode with **local models** (weak chooser — primary motivation).
- One ranking engine, two thin adapters. Single offline binary.
- rtk-style setup: `ski init -g` / `ski init -g opencode`.

### Non-goals
- Not a vector DB / RAG-over-codebase tool. N(skills) is tiny (10s–100s); cosine
  over an in-memory array beats running qdrant **or chroma** — both add a
  Python/server dependency that breaks the single-binary, offline design. If we ever
  outgrow in-memory (1000s+ skills), step to an **embedded Rust ANN**
  (`usearch` / `hnsw_rs`): in-process, no server — still not a DB. (See [[2-background]].)
- Not replacing progressive disclosure — augmenting it for weak choosers.
- Not authoring skills, not editing skill files.

---

## 2. Background — what exists, and the gap {#2-background}

| Project | Decision | Injection | Gap vs. us |
|---|---|---|---|
| Claude Code skills (native) | model reads always-present `name`+`description`, decides | model self-loads `SKILL.md` | reliable ~20% w/o nudge; needs the model to decide |
| `joshuadavidthomas/opencode-agent-skills` | local semantic *similarity* | injects `<available-skills>` + **prompt encouraging** agent to call `use_skill` | still routes through a **tool call + model agency** — not automatic content injection |
| `jefflester/claude-skills-supercharged` | **Haiku** intent analysis + keyword fallback | `UserPromptSubmit` injects skill, once/conversation, 1h intent cache | decision uses a **cloud model**; we want it fully local |

**Our differentiator:** local embedding decision **+ direct content injection** (no
`use_skill`, no model agency in the loop). Borrow the good parts: supercharged's
inject-once tracking and intent cache; opencode-agent-skills' SDK injection
plumbing.

---

## 3. Architecture

```
                       ┌─────────────────────────────┐
   user prompt ───────▶│  ADAPTER (per host)         │
                       │   • Claude: UserPromptSubmit │
                       │     hook  → binary (stdin)   │
                       │   • opencode: TS plugin →     │
                       │     binary (spawn)           │
                       └──────────────┬──────────────┘
                                      │ JSON {prompt, session_id, cwd, host}
                       ┌─────────────────────────────┐
                       │  ski core (Rust, 1 binary)   │
                       │                              │
                       │  1. load INDEX (skill vecs)  │◀── INDEX cache (persistent,
                       │  2. embed(prompt)  fastembed │      invalidated by mtime/hash)
                       │  3. hybrid score:            │
                       │       cosine + keyword boost │
                       │  4. threshold + top-K        │
                       │  5. session dedup            │◀── SESSION state (per session_id)
                       │  6. read SKILL.md bodies     │
                       │  7. emit injection JSON      │──▶ DECISION cache (TTL, optional)
                       └──────────────┬──────────────┘
                                      │ JSON {inject: "<text>", skills:[...]}
                       ADAPTER injects as context (additionalContext / synthetic msg)
```

**Why Rust single binary:** matches rtk; `fastembed-rs` (ONNX) gives all-MiniLM
embeddings with no Python/venv and no running service. The same binary is the hook,
the indexer, and the installer. opencode's TS plugin just spawns it.

---

## 4. Components

### 4.1 Core binary `ski`
Subcommands:
- `ski init [-g] [opencode]` — install/configure host(s), download model, build index. (§9)
- `ski index [--rebuild]` — discover skills, embed descriptions, write index. Incremental by file hash. **Auto-runs on every SessionStart** (incremental, hash-gated → cheap) since skills drift over time.
- `ski hook --host <claude|opencode>` — read hook event JSON on **stdin**, write injection JSON on **stdout**. The hot path.
- `ski observe --host <claude|opencode>` — record skills the **model** loaded itself (§4.8) into session state, so we don't re-inject them.
- `ski session-start --host <claude|opencode>` — incremental reindex + re-arm session state (clear `loaded` on compaction).
- `ski why <prompt>` — debug: print ranked skills + scores (no injection). Tuning aid.
- `ski stats` — optional later: injection counts / hit rate (rtk-`gain` vibe).

### 4.2 Skill discovery
Scan roots, parse `SKILL.md` YAML frontmatter (`name`, `description`, optional
`keywords`/`aliases`). Roots:
- Claude: `~/.claude/skills/`, plugin skills `~/.claude/plugins/**/skills/`, project `.claude/skills/`.
- opencode: equivalent skill dirs + project.
- Honor a config `extra_roots`.

### 4.3 Embedding
`fastembed-rs` (ONNX via `ort`; optional CUDA exec-provider on GPU boxes).
**Default: `bge-small-en-v1.5`** (384-dim, stronger recall — sensible default given
GPU / ample RAM, e.g. the Strix Halo box). **Lite alt: `all-MiniLM-L6-v2-q`**
(`model = "..."` or `ski init --lite`) for low-RAM / CPU-only machines. Both 384-dim;
the index is **model-tagged**, so switching model forces a full reindex.

bge is **asymmetric**: prefix the *query* (prompt) with bge's retrieval instruction
(`"Represent this sentence for searching relevant passages: "`) but embed skill
descriptions **without** prefix. (MiniLM is symmetric — no prefix either side.)

Model downloaded on `init` → `~/.local/share/ski/models/`, then fully offline.
Optional `--features embed-model` `include_bytes!` build for a zero-download binary.

### 4.4 Index store
`~/.local/share/ski/index.json` (+ optional `vectors.bin` of packed f32):
```jsonc
{
  "model": "all-MiniLM-L6-v2-q",
  "dim": 384,
  "skills": [
    {
      "id": "git-tools/git-attribution",
      "name": "git-attribution",
      "description": "Kernel-style AI commit attribution ...",
      "path": "/home/b/.claude/plugins/.../SKILL.md",
      "host": "claude",
      "keywords": ["commit", "attribution", "assisted-by"],
      "hash": "sha256:...",            // SKILL.md content hash → cache invalidation
      "embedding": [0.013, -0.21, ...] // 384 f32 of the *description*
    }
  ]
}
```
Reindex only entries whose `hash` changed. (Index = the **big cache**; computed once.)

### 4.5 Decision pipeline (the hot path)
1. Parse stdin event → `{prompt, session_id, cwd, host}`.
2. Load index; if any skill-file mtime newer than index, incremental reindex.
3. `q = embed(prompt)`.
4. **Hybrid score** per skill:
   `score = cosine(q, skill.embedding) + Σ keyword_boost(prompt, skill.keywords)`
   Exact keyword/alias substring → fixed boost (or `force` flag → always inject).
5. Keep `score ≥ min_similarity`, sort desc, cap `max_skills`, respect `char_budget`.
6. **Session dedup:** drop skills already in this session's state — whether **we**
   injected them or the **model loaded them itself** (§4.8). Cleared on compaction.
7. Read surviving `SKILL.md` bodies → build injection text per `inject_mode` & `directive_strength`.
8. Append injected ids to session state; write decision cache.
9. Emit injection JSON.

### 4.6 Session state
`~/.local/state/ski/sessions/<session_id>.json`:
```json
{ "loaded": { "git-tools/git-attribution": "ski", "python-dev/uv-setup": "model" },
  "updated": "2026-06-14T12:00:00Z" }
```
Dedup skips **any** id in `loaded`, regardless of source. `ski` = we injected it;
`model` = the model loaded it on its own (§4.8) — we must not re-inject a decision
the model already made. **Re-arm on compaction** (Claude `SessionStart` matcher
`compact`; opencode `session.compacted`): clear `loaded` so skills can be
re-injected into the fresh summary.

### 4.8 Detect model-loaded skills
The model may pull a skill itself (Claude's native chooser; opencode `use_skill`).
Record those so we don't double-inject:
- **Claude:** `PostToolUse` on `Read`/`Skill` → if the path matches `**/SKILL.md`,
  derive the skill id and mark `loaded[id] = "model"`.
- **opencode:** observe the skill-load / tool-exec event → same.

`ski observe` does this. Especially important in `directive` mode: once the model
acts on our directive and reads the skill, we stop re-prompting it.

### 4.7 Caches (two layers — keep them distinct)
- **Index cache** (skill-description embeddings): persistent, invalidated by file
  hash. *Mandatory, big win.*
- **Decision cache** (`hash(prompt + index_version)` → skill ids, TTL ~1h):
  *optional.* Locally the embed is already ~ms, so this mainly skips re-IO on
  identical re-submits. Mirrors supercharged's intent cache but cheaper to justify.
- **Prompt cache (Anthropic):** we do **not** set `cache-control`. `additionalContext`
  appends *after* the cached system/skills prefix, so injection is cache-safe by
  construction. Re-injecting the same body every turn would bloat context → that is
  exactly what §4.6 session dedup prevents. (opencode local models: no prompt cache;
  same anti-bloat reasoning.)

---

## 5. Config

`~/.config/ski/config.toml` (project `.ski.toml` overrides):
```toml
model            = "bge-small-en-v1.5"  # default; "all-MiniLM-L6-v2-q" = lite (low-RAM/CPU). custom ONNX = later
min_similarity   = 0.35       # cosine threshold — TUNE with `ski why`
max_skills       = 2
char_budget      = 6000       # cap total injected chars
inject_mode      = "directive" # v1 default: tell model to load (keep agency). "body" = inject SKILL.md
directive_strength = "auto"   # auto | soft | hard  (see §6)
decision_cache_ttl = "1h"
extra_roots      = []
deny  = []                    # skill ids never auto-injected
force = []                    # skill ids always injected when keyword hit
```

---

## 6. Per-host / per-model intensity

| Host | Default `inject_mode` | `directive_strength` | Rationale |
|---|---|---|---|
| Claude Code | `directive` | `soft` ("A relevant skill exists below — load it if applicable.") | strong native chooser; a nudge is enough |
| opencode + local model | `directive` | `hard` ("You MUST load and follow the skill below for this task.") | weak chooser → explicit |

`directive_strength = "auto"` resolves from `--host` + a `local_model` config flag.
**v1 ships `directive` everywhere** (keep model agency, observe follow-through);
flip to `body` — globally or per-skill — once eval shows local models ignore the
directive. `body` is the eventual auto-inject target, not the v1 default.

---

## 7. Code sketches

### 7.1 Core: rank (Rust)
```rust
use fastembed::{TextEmbedding, InitOptions, EmbeddingModel};

fn cosine(a: &[f32], b: &[f32]) -> f32 {
    let (mut dot, mut na, mut nb) = (0.0, 0.0, 0.0);
    for i in 0..a.len() { dot += a[i]*b[i]; na += a[i]*a[i]; nb += b[i]*b[i]; }
    dot / (na.sqrt() * nb.sqrt() + 1e-8)
}

fn rank(prompt: &str, index: &Index, cfg: &Config) -> Vec<Hit> {
    let model = TextEmbedding::try_new(
        InitOptions::new(cfg.embedding_model())   // bge-small-en-v1.5 (default) | MiniLM-q (lite)
            .with_cache_dir(cfg.model_dir())
    ).unwrap();
    // bge is asymmetric: prefix the query, NOT the skill descriptions (§4.3)
    let q = &model.embed(vec![format!("{}{}", cfg.query_prefix(), prompt)], None).unwrap()[0];

    let mut hits: Vec<Hit> = index.skills.iter().map(|s| {
        let kw = s.keywords.iter()
            .filter(|k| prompt.to_lowercase().contains(&k.to_lowercase()))
            .count() as f32 * cfg.keyword_boost;
        Hit { id: s.id.clone(), score: cosine(q, &s.embedding) + kw }
    }).collect();

    hits.retain(|h| h.score >= cfg.min_similarity || cfg.force.contains(&h.id));
    hits.sort_by(|a, b| b.score.partial_cmp(&a.score).unwrap());
    hits.truncate(cfg.max_skills);
    hits
}
```

### 7.2 Core: hook IO (`ski hook`)
```jsonc
// stdin (normalized by adapter)
{ "host": "claude", "session_id": "abc", "cwd": "/repo", "prompt": "fix the commit attribution" }

// stdout
{ "skills": ["git-tools/git-attribution"],
  "inject": "<skill name=\"git-attribution\">\n...SKILL.md body...\n</skill>" }
```

### 7.3 Claude adapter — `plugins/skill-inject/hooks/hooks.json`
```json
{
  "hooks": {
    "UserPromptSubmit": [
      { "hooks": [ { "type": "command",
        "command": "ski hook --host claude" } ] }
    ],
    "PostToolUse": [
      { "matcher": "Read|Skill",
        "hooks": [ { "type": "command",
          "command": "ski observe --host claude" } ] }
    ],
    "SessionStart": [
      { "matcher": "startup|resume|compact",
        "hooks": [ { "type": "command",
          "command": "ski session-start --host claude" } ] }
    ]
  }
}
```
`ski session-start` = incremental reindex + re-arm; `ski observe` records skills the
model loaded itself (§4.8). The `UserPromptSubmit` wrapper emits Claude's contract:
```json
{ "hookSpecificOutput": {
    "hookEventName": "UserPromptSubmit",
    "additionalContext": "<skill ...>...</skill>" } }
```
> Packaging: ship a tiny `scripts/ski-bootstrap.sh` (rtk-style) that resolves the
> `ski` binary (PATH → `~/.local/bin``cargo install` hint) so the plugin works
> even before a global `init`.

### 7.4 opencode adapter — TS plugin
```ts
import type { Plugin } from "@opencode-ai/plugin"

export const SkillInject: Plugin = async ({ $, client, directory }) => {
  return {
    // confirm exact event name against opencode 1.17.x (see Open Questions)
    "chat.message": async ({ message }, _out) => {
      if (message.role !== "user") return
      const payload = JSON.stringify({
        host: "opencode", session_id: message.sessionID,
        cwd: directory, prompt: message.text,
      })
      const res = await $`ski hook --host opencode`.stdin(payload).text()
      const { inject } = JSON.parse(res)
      if (inject) {
        // inject as synthetic context message (same plumbing as opencode-agent-skills)
        await client.session.message.create({
          sessionID: message.sessionID,
          parts: [{ type: "text", text: inject, synthetic: true }],
        })
      }
    },
  }
}
```
> Reference `joshuadavidthomas/opencode-agent-skills` for the exact injection call;
> opencode's "inject AI-visible message" support is maturing (issue #17412).
> Also wire `session.start → ski session-start` (reindex / re-arm) and the
> skill-load event `→ ski observe` (§4.8) to mirror the Claude adapter.

---

## 8. Guardrails
- `min_similarity` threshold + `max_skills` cap + `char_budget` → no context flooding.
- **Relative margin gate** (`score_margin`): drop any skill more than `score_margin`
  below the single best-scoring skill, measured against the global top **before**
  session dedup. Suppresses the weak tail (noisy embedders inject only near-peers)
  and makes re-submitting a handled prompt fall silent instead of scraping lower
  matches once the strong ones are deduped.
- `deny` list; `force` list for must-haves on keyword hit.
- Hybrid (embedding **+** keyword) so exact tool/command names aren't missed by
  embeddings alone.
- Session dedup → never re-inject a skill already loaded (by us **or the model**, §4.8)
  within a session; the only re-arm is compaction.
- Fail-open: any error in `ski hook` → emit empty injection, never block the prompt.

---

## 9. Init (rtk-style)
`ski init -g`:
1. Ensure binary on PATH (or copy to `~/.local/bin`).
2. Download embedding model → `~/.local/share/ski/models/`.
3. Run `ski index` (and re-run on every SessionStart thereafter — §4.1).
4. **Claude:** merge `UserPromptSubmit` + `PostToolUse` (observe model loads) +
   `SessionStart` (reindex & re-arm) hooks into `~/.claude/settings.json`
   (idempotent, like `install-plugins.sh`).

`ski init -g opencode`:
1–3 same.
4. Add plugin entry to `~/.config/opencode/opencode.json` `plugin[]` + the
   `@opencode-ai/plugin` dep; drop the TS adapter file.

Idempotent; `--auto-patch` for non-interactive. In **this repo**, also shipped as a
marketplace plugin (`plugins/skill-inject/`) whose `hooks.json` points at `ski`.

---

## 10. Milestones
1. **DONE — Core rank + golden tests.** Rust `ski` crate: skill discovery +
   `SKILL.md` frontmatter parse + embedding index + hybrid rank (cosine + keyword),
   exposed as `ski index` / `ski why`. `Embedder` trait with an offline
   bag-of-words backend (default; zero network/model — so build/test/CI run
   anywhere) and a `fastembed` backend (bge-small / MiniLM) behind a cargo feature.
   Golden tests assert `prompt → skill` on the real repo skills (`"…credit Claude in
   this commit"``git-attribution`, `"bootstrap a new python project with uv"`   `uv-setup`, etc.); verified top-1 on all 56 installed skills. fmt + clippy
   (`-D warnings`) clean; pre-commit hooks wired at repo root. `ski why` is the
   tuning harness. (`fastembed` feature compiles but isn't built in the offline lane
   — verify + tune `min_similarity` against bge before milestone 2 ships injection.)
2.**DONE — Hook path.** `ski hook --host <claude|opencode>`: stdin event →
   load-or-build index → embed → hybrid rank → `select` (deny + threshold +
   relative margin gate + force + session dedup + `max_skills` cap) →
   `inject::build` (directive/body, host-aware
   strength, `char_budget`) → host contract on stdout (Claude `additionalContext` /
   opencode `{skills,inject}`). Per-session dedup ledger at
   `$XDG_STATE_HOME/ski/sessions/<id>.json` (`loaded: {id: "ski"|"model"}`); a skill
   is never re-injected in a session. Fails open everywhere (empty injection, exit 0).
   Decision cache deferred to milestone 6 (low local payoff). fmt + clippy + tests
   green. *(min_similarity + score_margin still tuned for bow — re-tune for bge.)*
3. **Claude adapter** — hooks.json + bootstrap + `additionalContext`. End-to-end on Claude Code.
4. **opencode adapter** — TS plugin spawning binary; confirm injection event/API.
5. **`ski init`** — global install for both hosts; package as marketplace plugin here.
6. **Polish**`ski stats`, per-model intensity, compaction re-arm.

---

## 11. Decisions & open questions

**Resolved**
- **Name = `ski`** (product `skill-inject`).
- **Embedding model = `bge-small-en-v1.5`** default — stronger recall for GPU/RAM-rich boxes; `all-MiniLM-L6-v2-q` lite fallback for constrained machines (§4.3). Other local models / custom ONNX = later.
- **`inject_mode` default = `directive`** — keep model agency for v1; revisit `body` after measuring local-model follow-through (§6).
- **Model bundling = download-on-init** — cache to `~/.local/share/ski/models/`; offline `include_bytes!` build stays a feature flag.
- **Reindex on every SessionStart** — skills drift; incremental + hash-gated so it's cheap (§4.1).
- **Track model-loaded skills** — yes; never re-inject the model's own choices (§4.8).

**Open**
1. **Prompt context for embedding** — prompt-only (lean v1) vs prompt + last assistant turn (better recall, noisier).
2. **opencode hook event** — exact event name + injection API in opencode 1.17.x (verify vs `opencode-agent-skills`).
3. **Decision cache** — defer to milestone 6 (low local payoff).