rho-cli 0.1.25 - Docs.rs

# AI Agent Security: Summary of Research Papers

## Papers Reviewed

1. **CaMeL: Defeating Prompt Injections by Design** (Debenedetti et al., 2025) - Google/ETH Zurich
2. **Design Patterns for Securing LLM Agents against Prompt Injections** (Beurer-Kellner et al., 2025) - Multi-institutional
3. **The Attacker Moves Second** (Nasr, Carlini et al., 2025) - OpenAI/Anthropic/Google DeepMind/ETH Zurich
4. **Google's Approach for Secure AI Agents** (Diaz, Kern, Olive, 2025) - Google
5. **Agents Rule of Two** (Meta, 2025)
6. **Designing AI Agents to Resist Prompt Injection** (OpenAI, 2026)
7. **Mitigating the Risk of Prompt Injections in Browser Use** (Anthropic, 2025)
8. **The Lethal Trifecta** (Simon Willison, 2025)
9. **Understanding Prompt Injections: A Frontier Security Challenge** (OpenAI, 2025)

---

## Core Problem

All papers converge on the same fundamental issue: **prompt injection is an unsolved, fundamental vulnerability** in LLM-based agents. When agents combine access to private data, exposure to untrusted content, and the ability to take external actions, attackers can hijack agent behavior to exfiltrate data or trigger harmful actions.

Key characteristics of the problem:
- LLMs cannot reliably distinguish trusted instructions from untrusted data
- Attacks increasingly resemble social engineering rather than simple prompt overrides
- The risk scales directly with agent autonomy and capability
- No single defense provides absolute guarantees

---

## Key Findings

### 1. No Current Defense is Robust Against Adaptive Attackers
**Source: The Attacker Moves Second**

- 12 recent defenses (prompting, training, filtering, secret-knowledge) were all bypassed with >90% attack success rate using adaptive attacks
- Defenses that reported near-zero attack success rates on static benchmarks were among the easiest to break
- Human red-teamers succeeded in 100% of evaluated scenarios
- Static evaluation benchmarks provide a false sense of security
- Training against fixed attack sets does not generalize to novel attacks

### 2. The "Lethal Trifecta" / "Rule of Two" Framework
**Sources: Willison, Meta**

Three dangerous capabilities that, when combined, create exploitable attack chains:
- **[A]** Processing untrusted inputs (emails, web pages, documents)
- **[B]** Access to sensitive data or private systems
- **[C]** Ability to change state or communicate externally

**Meta's Rule of Two**: Until prompt injection is reliably solvable, agents should satisfy **no more than two of the three properties** within a session. If all three are needed, human-in-the-loop supervision is required.

Configuration examples:
- **[AB]**: Can read untrusted data and access private systems, but cannot take external actions without human confirmation
- **[AC]**: Can browse the web and take actions, but operates in a sandbox without private data
- **[BC]**: Can access private systems and take actions, but only processes trusted/vetted inputs

### 3. System-Level Design Beats Model-Level Fixes
**Sources: CaMeL, Design Patterns paper**

Model-level defenses (adversarial training, prompt engineering, guardrails) are fundamentally heuristic and brittle. System-level architectural patterns offer stronger guarantees:

**Six Design Patterns** (from Design Patterns paper):
1. **Action-Selector**: LLM selects from predefined actions only; never processes raw data
2. **Plan-Then-Execute**: LLM commits to a fixed plan before seeing untrusted data; plan cannot be altered by injections
3. **LLM Map-Reduce**: Isolated sub-agents process individual data items independently; outputs are constrained before aggregation
4. **Dual LLM**: Privileged LLM plans actions (never sees untrusted data); Quarantined LLM processes data (has no tool access)
5. **Code-Then-Execute**: LLM writes a formal program that is then executed deterministically; extends Dual LLM with capabilities tracking
6. **Context-Minimization**: Removes unnecessary user prompt content after initial processing to prevent injection carry-over

**CaMeL System** (Google/ETH Zurich):
- Extracts control flow and data flow from trusted user queries as pseudo-Python code
- Assigns capability metadata to every value (tracking sources and allowed recipients)
- Enforces security policies via a custom interpreter at tool-call time
- Achieves 77% task completion with provable security (vs. 84% undefended) on AgentDojo benchmark
- Does not modify the underlying LLM; works as a protective layer around it

### 4. Google's Hybrid Defense-in-Depth
**Source: Google's Approach paper**

Three core principles:
1. **Agents must have well-defined human controllers** - clear accountability, distinct agent identities, user consent for critical actions
2. **Agent powers must have limitations** - dynamic least-privilege, scoped credentials, agents cannot self-escalate
3. **Agent actions must be observable** - robust logging, characterized action APIs, transparent UX

Two-layer defense:
- **Layer 1 (Deterministic)**: Policy engines intercept actions before execution; evaluate against predefined rules based on risk, context, and action history
- **Layer 2 (Reasoning-based)**: Adversarial training, guard classifier models, plan risk prediction using AI
- Neither layer is sufficient alone; they complement each other

### 5. Prompt Injection is Evolving Toward Social Engineering
**Source: OpenAI's Designing AI Agents paper**

- Real-world attacks now embed instructions in contextually plausible content (e.g., fake HR compliance emails)
- Traditional "AI firewalling" input classifiers often fail against these sophisticated attacks
- OpenAI frames the defense problem through a customer-service-agent analogy: the agent exists in an adversarial environment and must have systemic limits on its capabilities
- **Safe URL mechanism**: Detects when information from the conversation would be transmitted to a third party; blocks or prompts user confirmation
- Sandboxed execution for generated applications (Canvas, Codex, Apps)

### 6. Browser Agents Face Amplified Risk
**Source: Anthropic's Mitigating Prompt Injections paper**

- Browser use amplifies prompt injection risk due to vast attack surface (every page, ad, script) and powerful action capabilities (navigate, fill forms, click, download)
- Claude Opus 4.5 demonstrates improved robustness; new safeguards reduced Best-of-N attack success rate significantly
- Defense layers: reinforcement learning for robustness, improved classifiers for untrusted content scanning, scaled expert human red-teaming
- A 1% attack success rate still represents meaningful risk

---

## Consensus Recommendations

### For Agent Developers

1. **Never trust model-level defenses alone** - Always combine with system-level architectural controls
2. **Apply the Rule of Two** - Limit agents to at most two of: untrusted input, private data access, external communication
3. **Use application-specific agents over general-purpose ones** - Constraining agent scope enables stronger security guarantees
4. **Separate control flow from data flow** - Use patterns like Dual LLM or CaMeL to ensure untrusted data cannot alter the sequence of actions
5. **Implement deterministic policy enforcement** - Policy engines that operate outside the LLM's reasoning provide reliable hard limits
6. **Require human confirmation for high-risk actions** - Irreversible or sensitive actions should always have human-in-the-loop
7. **Apply least privilege dynamically** - Agent permissions should be scoped to the specific task and context, not statically defined
8. **Log everything** - Actions, inputs, tool calls, parameters, and reasoning steps for auditability and incident response
9. **Constrain outputs from untrusted data processing** - When LLMs process untrusted content, enforce strict output schemas (booleans, enums, structured data)
10. **Sanitize rendered output** - Prevent XSS, data exfiltration via crafted URLs or embedded content in agent responses

### For Evaluation and Testing

1. **Use adaptive attacks, not static benchmarks** - Fixed test sets overstate defense effectiveness
2. **Include human red-teaming** - Automated attacks are necessary but insufficient; human creativity finds attacks that automation misses
3. **Assume the attacker knows the defense** - Following Kerckhoffs' principle, evaluate under worst-case attacker knowledge
4. **Test the full attack chain** - Evaluate from injection through data access through exfiltration, not isolated components
5. **Verify auto-raters independently** - Automated safety classifiers can themselves be fooled by adversarial inputs

### For Users

1. **Give explicit, specific instructions** - Broad instructions like "review my emails and take action" make injection easier
2. **Limit agent access to only necessary data** - Use logged-out modes, restrict credentials
3. **Watch agents on sensitive sites** - Monitor agent behavior during high-risk operations
4. **Review confirmation prompts carefully** - Don't blindly approve agent actions
5. **Avoid combining the lethal trifecta of tools** - Be aware when mixing MCP tools that combine private data access, untrusted content, and external communication

---

## Open Problems

- **Prompt injection remains fundamentally unsolved** - No defense provides 100% reliable protection
- **User fatigue from security prompts** - Excessive confirmation dialogs lead to rubber-stamping
- **Capability-based systems require ecosystem buy-in** - Third-party tools may not support capability metadata
- **Side-channel attacks** - Even with robust control/data flow separation, information can leak through timing, error messages, or other indirect channels
- **Scaling policy definition** - Writing comprehensive security policies for vast action ecosystems is complex and difficult to maintain
- **Formal verification** - Proving security properties of agent systems mathematically remains future work
- **General-purpose secure agents** - Current approaches trade off utility for security; truly general-purpose secure agents may not be achievable with current LLM architectures