armorer-guard 0.2.4

Fast local Rust scanner for AI-agent prompt injection, credential leaks, exfiltration, and risky tool calls.
# Armorer Guard Capabilities

This document describes what the current Guard binary does, what labels it can
emit, and what it does not claim to solve yet.

For the machine-readable source of truth, run:

```bash
armorer-guard capabilities
```

or during development:

```bash
cargo run --quiet -- capabilities
```

## Public API

Armorer consumes Guard through this stable contract:

- `inspect_input(text, context=None)`
- `inspect_output(text, context=None)`
- `sanitize_text(text)`
- `detect_credentials(text, context=None)`
- `capabilities()`

The `context` argument is optional and now feeds the policy and semantic lanes
when present. Text-only callers keep the legacy behavior.

Supported context fields:

- `eval_surface`
- `trace_stage`
- `artifact_kind`
- `policy_action`
- `policy_scope`
- `tool_name`
- `destination`

## Output Contract

`inspect_input` and `inspect_output` return:

- `sanitized_text`
- `suspicious`
- `reasons`
- `confidence`
- `scan_id`
- `model_version`
- `learning_version`

`detect_credentials` returns either `null` or:

- `captured_value`
- `sanitized_text`
- `confidence`
- `reasons`
- `credential_type`
- `suggested_key_name`
- `flags`
- `matches`

## Credential Lane

Purpose:

- capture credentials from user setup messages
- redact sensitive values before persistence
- infer the provider type
- suggest an environment variable name

Current credential types:

- `notion`
- `github`
- `openrouter`
- `openai`
- `gemini`
- `telegram_bot`
- `generic_secret`

Stable reason:

- `detected:credential`

Redaction examples:

- GitHub tokens -> `[REDACTED_GITHUB_TOKEN]`
- Notion keys -> `[REDACTED_NOTION_KEY]`
- Gemini keys -> `[REDACTED_GEMINI_KEY]`
- Telegram bot tokens -> `[REDACTED_TELEGRAM_TOKEN]`
- assignment-style secrets -> `[REDACTED_SECRET_VALUE]`

Credential detection is deterministic and intentionally not ML-based.

## Semantic Lane

Purpose:

- identify non-token suspicious requests
- keep semantic categories separate from credential capture
- produce stable reason labels for Armorer UI, alerts, and evals

Current reason labels:

- `semantic:prompt_injection`
- `semantic:system_prompt_extraction`
- `semantic:data_exfiltration`
- `semantic:sensitive_data_request`
- `semantic:safety_bypass`
- `semantic:destructive_command`

Current behavior:

- local only
- deterministic lexical/rule scoring
- bundled Rust-native TF-IDF linear classifier, `word-sgd-native-v1`
- per-category classifier thresholds
- context discounts for retrieved content, model output, and agent actions
- no network calls

This lane is still intentionally lightweight. The next major target is a
stronger local semantic model or embeddings lane that improves generalized
prompt-injection recall without training on eval rows.

## Similarity Lane

Purpose:

- catch prompts similar to known Armorer-owned attack exemplars
- provide a lightweight bridge before local embeddings are added

Current behavior:

- tokenizes text locally
- computes Jaccard similarity
- maps similar prompts to the same semantic reason labels
- indexes only `src/dev_exemplars.tsv` rows marked `can_train=true`
- never indexes eval rows from `dev`, `regression`, `hard`, or `holdout` suites

Future behavior should replace or augment this with a local vector index.

## Learning Lane

Purpose:

- adapt local enforcement from operator feedback without scanner network calls
- keep deployment-specific allow/block/review corrections outside the repo
- preserve reviewed, versioned global model releases instead of silent drift

Current behavior:

- reads local exemplars from `~/.armorer-guard/feedback/local_exemplars.tsv`
- writes feedback events to `~/.armorer-guard/feedback/events.jsonl`
- supports `ARMORER_GUARD_HOME` for tests and deployments
- accepts sanitized feedback through `armorer-guard feedback-record`
- exports reviewed rows through `armorer-guard feedback-export --reviewed-only`
- reports counts through `armorer-guard feedback-stats`

Current reason labels:

- `learning:local_allow_match`
- `learning:local_block_match`
- `learning:local_review_match`

Safety boundaries:

- raw text is optional; sanitized excerpts and stable hashes are enough
- feedback defaults to `can_train=false`
- `can_train=true` requires `reviewed=true`
- local allow feedback cannot suppress `detected:credential`
- local allow feedback cannot suppress `policy:credential_disclosure`
- local allow feedback cannot suppress `policy:dangerous_tool_call`
- local learning does not mutate `src/semantic_classifier_native.tsv`
- local learning does not mutate `src/dev_exemplars.tsv`

## Eval Hygiene

Guard development data and release evaluation data are separate by design.

- `src/dev_exemplars.tsv` is the only current similarity source.
- Development exemplars must be Armorer-owned, provenance-tagged, and explicitly marked trainable.
- Regression, hard, and holdout eval case text must not be copied into Rust rules, similarity exemplars, classifier prompts, or model training data.
- Unreviewed local feedback must not train public models.
- Holdout failures may become visible regression cases only after the release decision, and only with rewritten wording that tests the general behavior rather than the exact prompt.

## Policy Lane

Purpose:

- mark requests that are dangerous because of intended action, not just wording
- make high-risk action categories blockable in Armorer

Current reason labels:

- `policy:credential_disclosure`
- `policy:dangerous_tool_call`

Current mappings:

- data exfiltration -> `policy:credential_disclosure`
- sensitive data request -> `policy:credential_disclosure`
- safety bypass -> `policy:dangerous_tool_call`
- destructive command -> `policy:dangerous_tool_call`

When structured context is present, the policy lane can also escalate:

- `credential_disclosure`, `outbound_transfer`, or secret-scoped sends -> data exfiltration / sensitive data request
- `system_disclosure` or `guard_internals` -> system prompt extraction
- `dangerous_tool_call`, `delete_state`, `force_push`, `drop_database`, `docker_prune`, or `sandbox_escape` -> destructive command
- `disable_guard`, `sandbox_escape`, or security-control scopes -> safety bypass

## Confidence Policy

Current confidence values are intentionally simple and stable:

- provider-specific credential capture: `0.99`
- generic secret capture: `0.75`
- credential redaction during inspection: at least `0.72`
- sensitive data request: `0.74`
- prompt injection: `0.88`
- system prompt extraction: `0.88`
- data exfiltration: `0.92`
- safety bypass: `0.91`
- destructive command: `0.94`

Classifier-only promotions use per-category thresholds:

- prompt injection: `0.78`
- system prompt extraction: `0.76`
- data exfiltration: `0.74`
- sensitive data request: `0.76`
- safety bypass: `0.76`
- destructive command: `0.72`

Armorer can set its own block threshold. The current eval baseline uses `0.80`,
so sensitive-data requests can be suspicious without always being blockable.

## Boundaries

Armorer Guard does not:

- call remote APIs
- execute tools
- mutate model weights
- store credentials
- train public models from unreviewed local feedback
- fetch third-party corpora at runtime
- rely on Python scanner logic

Armorer Guard does:

- classify text
- redact text
- capture credential values for the caller to store safely
- store sanitized local feedback when explicitly asked
- emit stable reasons and confidence

## Known Limitations

- The semantic lane is not a transformer classifier.
- The similarity lane uses simple token overlap, not embeddings.
- The current binary does not expose per-lane timing or per-lane confidence.
- The current binary is not a complete replacement for runtime confirmation
  policies in Armorer.

## Evaluation Expectations

The eval dashboard should track improvements across experiments:

- `baseline`
- `semantic-local-v1`
- `semantic-plus-similarity-v1`
- `semantic-plus-policy-v1`
- `context-aware-policy-v1`

Acceptance targets for the first smarter milestone:

- credential capture remains at `100%` on current credential cases
- benign false-positive rate stays under `5%`
- non-token expected-block recall reaches at least `70%`
- prompt injection, data exfiltration, safety bypass, and destructive command
  each reach at least `60%` recall