Module security

Expand description

Prompt-injection defense substrate (defense Layers 0/1).

Three concerns live here:

Content provenance / taint — a per-result TaintRecord tags output that crossed a trust boundary (an external MCP server, or a Fetch-kind tool reaching the open internet). The agent loop records these on the session ledger so the dispatch gate can apply the “lethal trifecta” rule (untrusted content in context + a tool that can leak it outward => require confirmation).
Spotlighting — spotlight_wrap frames untrusted observations in delimiters (and, in SecurityMode::Strict, datamarks every line) plus a provenance banner, so the model treats the span as data rather than instructions. (Microsoft “spotlighting”, arXiv 2403.14720.)
Classification — is_exfil_capable / is_destructive / is_secret_path read the existing tool taxonomy so the gate knows which tools can carry tainted context outward or read secrets.
Injection detection (Layer 2) — an InjectionClassifier scores untrusted content; the built-in HeuristicClassifier is always available and dependency-free, and a downloadable neural model (harn-guard) can override it via register_injection_classifier without the default binary ever linking a model runtime. A flagged score is recorded on the TaintRecord and tightens the trifecta gate.

The active SecurityPolicy is a thread-local stack mirroring crate::redact; embedders override it per run via the security_policy builtin (Harn std/security::configure). The default is spotlight-on, so untrusted content is always framed even when nothing is configured. The trifecta gate only fires where an interactive approval policy is installed, so non-interactive embedders (headless evals) are unaffected by it.

Re-exports§

pub use exfil_precision::args_target_endpoints;
pub use exfil_precision::destination_is_untrusted_originated;
pub use exfil_precision::extract_endpoints;
pub use exfil_precision::precise_exfil_gate_fires;
pub use file_provenance::command_string;
pub use file_provenance::path_arguments;
pub use file_provenance::FileProvenanceLedger;
pub use provenance::classify_directive_trust;
pub use provenance::DirectiveProvenance;

Modules§

battery: ASR (attack-success-rate) battery for the prompt-injection substrate.
behavioral: Behavioral ASR (attack-success-rate) tier for the prompt-injection substrate.
exfil_precision: Destination-provenance precision for the lethal-trifecta exfil gate.
file_provenance: Untrusted-origin file provenance (taint-on-write, distrust-on-read).
provenance: Origin-authenticated cross-agent directives (Phase 3 — capability enforcement).
stance_judge: Semantic stance tier for the behavioral ASR probe.

Structs§

DetectorVerdict: A prompt-injection detector’s verdict on a span of content (Layer 2).
HeuristicClassifier: Built-in, dependency-free injection heuristic. Precision-first: it favors strong, rarely-benign markers (instruction-override phrasing, concealment directives, hidden/bidi unicode) so a flagged verdict is a meaningful signal even though recall is limited. The downloadable harn-guard neural model supersedes it for better recall.
SecurityPolicy: Resolved, runtime-readable security policy. Derived from SecurityConfig; the default is spotlight-on.
TaintRecord: One entry in a session’s taint ledger: untrusted content from origin entered the model’s context.

Enums§

TrustLevel: Trust level attached to a unit of content entering the transcript.

Constants§

RESERVED_SPECIAL_TOKENS: Reserved chat-template / role special tokens that must never survive framing of untrusted content as live tokens: rendered into the chat template they can re-open a turn or inject a system message (ChatBug / ChatInject / MetaBreak). neutralize_special_tokens rewrites each one inside every untrusted span; the battery special-token corpus is drawn from the same set.

Traits§

InjectionClassifier: A prompt-injection classifier over a span of (untrusted) text, returning a malicious-probability in [0, 1].

Functions§

active_classifier: The active classifier: the registered neural backend when present, else the built-in heuristic. Always returns something — detection never silently becomes a no-op once enabled.
args_reference_secret: Whether any string anywhere in a tool’s arguments references a secret / credential path. Used to gate secret reads while context is tainted.
classify_injection: Score text with the active classifier and build a DetectorVerdict, marking it flagged when the score meets threshold_percent.
classify_result_trust: Classify a dispatched tool result’s content trust from its executor provenance and tool kind. Returns None for first-party/trusted content (no taint recorded). Explicitly-trusted MCP servers are skipped.
clear_policy_stack: Drop all installed policies. Used by tests and by reset_thread_state.
content_labels: Cheap, deterministic content signals attached to a TaintRecord. These double as a weak first-pass injection heuristic.
current_policy: The currently installed policy, falling back to SecurityPolicy::default (spotlight-on) when the stack is empty. Always an owned clone.
destyle_untrusted: Disrupt forged assistant/reasoning STYLE inside an untrusted span without changing meaning: line-leading role labels (User: / Assistant: / System:) and <think> reasoning tags can no longer read as a real turn or a real chain-of-thought. This is the paper’s strongest single fix — destyling the forged reasoning collapses CoT-forgery ASR (~61%→10%, arXiv:2603.12277) — kept as conservative defense-in-depth under the sentinel frame so benign content is untouched. Idempotent.
ensure_neural_classifier: Ensure a neural classifier is registered for selector, loading it via the installed loader on first use. Idempotent and cheap once resolved: returns immediately when a classifier is already registered, when no loader is installed (the default binary), or when selector is empty. Returns whether a neural backend is now active. A loader that returns None (model not installed, failed to load) leaves the heuristic in place.
is_agent_channel: Whether a tool returns another agent’s output over a delegation / A2A channel, declared by pipeline annotations carrying an agent_channel capability. Such a result is a cross-trust-boundary ingress: the peer agent is not part of this agent’s trusted context and may have been poisoned by content it ingested, so its output is untrusted DATA, never authority.
is_destructive: Whether a tool irreversibly removes or relocates content.
is_exfil_capable: Whether a tool can carry tainted context outward (network egress, fetch, or desktop control). Desktop control is an egress surface in two ways the GUI-agent security literature flags: a returned screenshot exfiltrates whatever is on screen to the model, and synthetic keyboard/mouse input can drive any application (paste into a URL bar, an upload dialog, a chat box) to send data outward. So the trifecta gate treats it like network egress: once untrusted content is in context, a desktop-control action is a potential exfiltration channel and is gated accordingly.
is_secret_path: Whether a path looks like a credential / secret store, used to gate secret reads while context is tainted. Conservative, well-known locations only.
mutates_workspace: Whether a tool mutates workspace files (write/patch/edit). The detection-expanded trifecta axis gates these when in-context untrusted content has been flagged as a likely injection.
neutralize_special_tokens: Neutralize every reserved special token inside an untrusted span. String-level containment: the reserved sequence no longer appears as a literal substring, so it cannot hijack turn segmentation once the surrounding transcript is rendered to a chat template. Idempotent (the neutralized form contains no reserved token) and surgical — only the exact reserved sequences are rewritten, so content that merely resembles a token (a lone <, |, or [) is untouched.
pin_and_detect_change: Pin tool_name’s schema hash for server and report whether it changed from a previously pinned value (a rug-pull signal). The first sighting establishes the trust-on-first-use baseline and returns false.
pop_policy: Pop the most recently pushed policy. Safe to call on an empty stack.
push_policy: Push a policy onto the thread-local stack. Pair with pop_policy.
register_injection_classifier: Install a process-global injection classifier (e.g. the harn-guard neural backend). Only the first registration wins; returns false if one was already installed. Dependency-free by design: the default binary never calls this, so it never links a model runtime.
register_security_builtins: Register the security_policy(config: dict) -> dict builtin. Embedders (the host, or std/security::configure) call it to push a resolved policy from their [security] config / feature flag.
reset_thread_state: Drop all per-thread security state (policy stack + MCP schema pins). Called by reset_thread_local_state so test runs sharing a thread cannot leak overrides or pins into each other.
set_injection_classifier_loader: Install the lazy neural-classifier loader. First install wins; returns false if one was already installed.
spotlight_wrap: Frame an untrusted observation so the model treats it as data, not instructions.
tool_schema_hash: Hash a tool’s identity-bearing fields (name + description + input schema). The digest is what the rug-pull defense pins and compares.

Type Aliases§

InjectionClassifierLoader: A lazy loader that materializes a neural classifier from a model selector (a harn guard catalog name or model directory). Installed by a host built with the guard inference backend; harn-vm calls it the first time a local-ml policy actually scores untrusted content, so the (heavy) model is loaded on demand, never at startup.

Module security

Module security Copy item path

Re-exports§

Modules§

Structs§

Enums§

Constants§

Traits§

Functions§

Type Aliases§

Module security