Expand description
Prompt-injection defense substrate (defense Layers 0/1).
Three concerns live here:
- Content provenance / taint — a per-result
TaintRecordtags output that crossed a trust boundary (an external MCP server, or aFetch-kind tool reaching the open internet). The agent loop records these on the session ledger so the dispatch gate can apply the “lethal trifecta” rule (untrusted content in context + a tool that can leak it outward => require confirmation). - Spotlighting —
spotlight_wrapframes untrusted observations in delimiters (and, inSecurityMode::Strict, datamarks every line) plus a provenance banner, so the model treats the span as data rather than instructions. (Microsoft “spotlighting”, arXiv 2403.14720.) - Classification —
is_exfil_capable/is_destructive/is_secret_pathread the existing tool taxonomy so the gate knows which tools can carry tainted context outward or read secrets. - Injection detection (Layer 2) — an
InjectionClassifierscores untrusted content; the built-inHeuristicClassifieris always available and dependency-free, and a downloadable neural model (harn-guard) can override it viaregister_injection_classifierwithout the default binary ever linking a model runtime. A flagged score is recorded on theTaintRecordand tightens the trifecta gate.
The active SecurityPolicy is a thread-local stack mirroring
crate::redact; embedders override it per run via the security_policy
builtin (Harn std/security::configure). The default is spotlight-on, so
untrusted content is always framed even when nothing is configured. The
trifecta gate only fires where an interactive approval policy is installed,
so non-interactive embedders (headless evals) are unaffected by it.
Re-exports§
pub use exfil_precision::args_target_endpoints;pub use exfil_precision::destination_is_untrusted_originated;pub use exfil_precision::extract_endpoints;pub use exfil_precision::precise_exfil_gate_fires;pub use file_provenance::command_string;pub use file_provenance::path_arguments;pub use file_provenance::FileProvenanceLedger;pub use provenance::classify_directive_trust;pub use provenance::DirectiveProvenance;
Modules§
- battery
- ASR (attack-success-rate) battery for the prompt-injection substrate.
- behavioral
- Behavioral ASR (attack-success-rate) tier for the prompt-injection substrate.
- exfil_
precision - Destination-provenance precision for the lethal-trifecta exfil gate.
- file_
provenance - Untrusted-origin file provenance (taint-on-write, distrust-on-read).
- provenance
- Origin-authenticated cross-agent directives (Phase 3 — capability enforcement).
- stance_
judge - Semantic stance tier for the behavioral ASR probe.
Structs§
- Detector
Verdict - A prompt-injection detector’s verdict on a span of content (Layer 2).
- Heuristic
Classifier - Built-in, dependency-free injection heuristic. Precision-first: it favors
strong, rarely-benign markers (instruction-override phrasing, concealment
directives, hidden/bidi unicode) so a flagged verdict is a meaningful signal
even though recall is limited. The downloadable
harn-guardneural model supersedes it for better recall. - Security
Policy - Resolved, runtime-readable security policy. Derived from
SecurityConfig; the default is spotlight-on. - Taint
Record - One entry in a session’s taint ledger: untrusted content from
originentered the model’s context.
Enums§
- Trust
Level - Trust level attached to a unit of content entering the transcript.
Constants§
- RESERVED_
SPECIAL_ TOKENS - Reserved chat-template / role special tokens that must never survive framing
of untrusted content as live tokens: rendered into the chat template they can
re-open a turn or inject a system message (ChatBug / ChatInject / MetaBreak).
neutralize_special_tokensrewrites each one inside every untrusted span; thebatteryspecial-token corpus is drawn from the same set.
Traits§
- Injection
Classifier - A prompt-injection classifier over a span of (untrusted) text, returning a
malicious-probability in
[0, 1].
Functions§
- active_
classifier - The active classifier: the registered neural backend when present, else the built-in heuristic. Always returns something — detection never silently becomes a no-op once enabled.
- args_
reference_ secret - Whether any string anywhere in a tool’s arguments references a secret / credential path. Used to gate secret reads while context is tainted.
- classify_
injection - Score
textwith the active classifier and build aDetectorVerdict, marking it flagged when the score meetsthreshold_percent. - classify_
result_ trust - Classify a dispatched tool result’s content trust from its executor
provenance and tool kind. Returns
Nonefor first-party/trusted content (no taint recorded). Explicitly-trusted MCP servers are skipped. - clear_
policy_ stack - Drop all installed policies. Used by tests and by
reset_thread_state. - content_
labels - Cheap, deterministic content signals attached to a
TaintRecord. These double as a weak first-pass injection heuristic. - current_
policy - The currently installed policy, falling back to
SecurityPolicy::default(spotlight-on) when the stack is empty. Always an owned clone. - destyle_
untrusted - Disrupt forged assistant/reasoning STYLE inside an untrusted span without
changing meaning: line-leading role labels (
User:/Assistant:/System:) and<think>reasoning tags can no longer read as a real turn or a real chain-of-thought. This is the paper’s strongest single fix — destyling the forged reasoning collapses CoT-forgery ASR (~61%→10%, arXiv:2603.12277) — kept as conservative defense-in-depth under the sentinel frame so benign content is untouched. Idempotent. - ensure_
neural_ classifier - Ensure a neural classifier is registered for
selector, loading it via the installed loader on first use. Idempotent and cheap once resolved: returns immediately when a classifier is already registered, when no loader is installed (the default binary), or whenselectoris empty. Returns whether a neural backend is now active. A loader that returnsNone(model not installed, failed to load) leaves the heuristic in place. - is_
agent_ channel - Whether a tool returns another agent’s output over a delegation / A2A
channel, declared by pipeline annotations carrying an
agent_channelcapability. Such a result is a cross-trust-boundary ingress: the peer agent is not part of this agent’s trusted context and may have been poisoned by content it ingested, so its output is untrusted DATA, never authority. - is_
destructive - Whether a tool irreversibly removes or relocates content.
- is_
exfil_ capable - Whether a tool can carry tainted context outward (network egress, fetch, or desktop control). Desktop control is an egress surface in two ways the GUI-agent security literature flags: a returned screenshot exfiltrates whatever is on screen to the model, and synthetic keyboard/mouse input can drive any application (paste into a URL bar, an upload dialog, a chat box) to send data outward. So the trifecta gate treats it like network egress: once untrusted content is in context, a desktop-control action is a potential exfiltration channel and is gated accordingly.
- is_
secret_ path - Whether a path looks like a credential / secret store, used to gate secret reads while context is tainted. Conservative, well-known locations only.
- mutates_
workspace - Whether a tool mutates workspace files (write/patch/edit). The detection-expanded trifecta axis gates these when in-context untrusted content has been flagged as a likely injection.
- neutralize_
special_ tokens - Neutralize every reserved special token inside an untrusted span. String-level
containment: the reserved sequence no longer appears as a literal substring, so
it cannot hijack turn segmentation once the surrounding transcript is rendered
to a chat template. Idempotent (the neutralized form contains no reserved
token) and surgical — only the exact reserved sequences are rewritten, so
content that merely resembles a token (a lone
<,|, or[) is untouched. - pin_
and_ detect_ change - Pin
tool_name’s schemahashforserverand report whether it changed from a previously pinned value (a rug-pull signal). The first sighting establishes the trust-on-first-use baseline and returnsfalse. - pop_
policy - Pop the most recently pushed policy. Safe to call on an empty stack.
- push_
policy - Push a policy onto the thread-local stack. Pair with
pop_policy. - register_
injection_ classifier - Install a process-global injection classifier (e.g. the
harn-guardneural backend). Only the first registration wins; returnsfalseif one was already installed. Dependency-free by design: the default binary never calls this, so it never links a model runtime. - register_
security_ builtins - Register the
security_policy(config: dict) -> dictbuiltin. Embedders (the host, orstd/security::configure) call it to push a resolved policy from their[security]config / feature flag. - reset_
thread_ state - Drop all per-thread security state (policy stack + MCP schema pins). Called
by
reset_thread_local_stateso test runs sharing a thread cannot leak overrides or pins into each other. - set_
injection_ classifier_ loader - Install the lazy neural-classifier loader. First install wins; returns
falseif one was already installed. - spotlight_
wrap - Frame an untrusted observation so the model treats it as data, not instructions.
- tool_
schema_ hash - Hash a tool’s identity-bearing fields (name + description + input schema). The digest is what the rug-pull defense pins and compares.
Type Aliases§
- Injection
Classifier Loader - A lazy loader that materializes a neural classifier from a model selector
(a
harn guardcatalog name or model directory). Installed by a host built with the guard inference backend;harn-vmcalls it the first time alocal-mlpolicy actually scores untrusted content, so the (heavy) model is loaded on demand, never at startup.