rho-cli 0.1.22

Rho CLI tools for encrypted agent collaboration, dataset publishing, controlled runs, and result release workflows
Documentation
# Rho Architecture Analysis

## What Rho Is

Rho is a multi-agent harness for running LLM agents against private data without exposing that data to the agents directly. It coordinates two (or more) agents across trust boundaries, using a shared filesystem as the transport layer, sandbox isolation via Gondolin (a microVM runtime), and explicit human-in-the-loop approval gates for all protected operations.

The system is designed for a specific use case: **Agent2 writes code against mock data, then Agent1 (the data owner) reviews and executes that code against real private data inside a sandbox, and only releases approved aggregate results back to Agent2.**

---

## System Components

### Entry Points (Shell Scripts)

| Script | Purpose |
|--------|---------|
| `rho` | Main CLI dispatcher. Routes to `rho run` (sandbox), `rho request` (approval), `rho approve`, or `rho agent-run` (declarative agent steps). Also runs Pi (the underlying LLM agent) directly for a given user+prompt. |
| `rho-chat` | Opens an interactive Pi session for a given user. |
| `rho-dataset` | Creates twin dataset bundles: a mock variant for sharing and a real variant kept private. Linked by UUID. |
| `rho-reset` | Wipes per-user state. |
| `rho-telegram` | Telegram bot bridge for human-in-the-loop approval (Rust binary in `telegram-bridge/`). |
| `rho-gondolin` | Runs a Gondolin sandbox configuration. |
| `publish` | Copies a shareable dataset bundle from a user's local staging area into the shared folder. Only the mock variant is published. |

### Core Python Scripts

| Script | Purpose |
|--------|---------|
| `scripts/rho-agent-run.py` | Declarative agent step runner. Takes a YAML config specifying user, prompt, extensions, sandbox settings, and fixture mode. Can run on host, in Gondolin sandbox, or replay from recorded fixtures. |
| `scripts/rho-controlled-action-relay.py` | Host-side relay for controlled actions. Validates action payloads from a sandboxed agent's outbox, routes approval through middleware (auto/telegram/fixture), executes the protected action deterministically on the host, and writes status back to the agent's inbox. |
| `scripts/rho-request.py` | Request lifecycle management. Creates, inspects, and resolves approval state for data access requests. Writes approval manifests and notification messages to agent inboxes. |

### TypeScript Extension (Pi Plugin)

| File | Purpose |
|------|---------|
| `extensions/rho-controlled-action-bridge.ts` | Pi extension that exposes two tools to sandboxed agents: `request_controlled_action` (writes a validated intent file to `/control/outbox/`) and `get_controlled_action_status` (reads host-written status from `/control/inbox/`). Performs strict input validation on both sides. |

### Infrastructure

| Component | Purpose |
|-----------|---------|
| `repos/pi/` | The underlying LLM agent runtime (Pi). Provides tool-use, sessions, extensions, and model access. |
| `repos/gondolin/` | MicroVM sandbox runtime. Provides filesystem isolation via mounts, network control with default-deny and host allowlists, and HTTP audit logging. |
| `telegram-bridge/` | Rust-based Telegram bot for human approval interactions. Supports inline approve/deny buttons and command-based approval. |

---

## How It Works: End-to-End Flow

### Phase 1: Data Staging

1. Data owner (Agent1) creates a **twin dataset** via `rho-dataset`: a mock CSV for sharing and a real CSV kept private, linked by UUID.
2. Agent1 **publishes** only the mock variant into a shared folder using `publish`.

### Phase 2: Code Development

3. Agent2 (the compute agent) reads the shared mock dataset and its schema.
4. Agent2 writes analysis code (e.g., `sum_prices.py`) against the mock data.
5. Agent2 tests the code locally against mock data.

### Phase 3: Controlled Execution Request

6. Agent2 creates a **run request** specifying: dataset UUID, code paths, command to execute, and requested tier (`real`).
7. The request is written as a YAML manifest into the shared `.rho/requests/` directory.
8. An inbox message notifies Agent1.

### Phase 4: Review and Approval

9. Agent1 reviews the code and request.
10. The request is routed through **approval middleware** (auto-approve for tests, Telegram for live operation, fixture policy for CI).
11. Human operator approves or denies via Telegram buttons, `/approve`, or `/deny` commands.

### Phase 5: Sandboxed Execution

12. If approved, **host-side relay code** (not the LLM) executes the protected action deterministically:
    - For `run_real_data`: runs the script against the real dataset via `python3 <script> <input>`
    - For `release_results`: copies approved outputs into shared results directory
13. Execution happens on the host, outside the sandbox, with full access to private data.
14. Results are staged locally first, never written directly to shared space.

### Phase 6: Result Release

15. A second approval gate controls whether staged results can be released.
16. If approved, results and a release manifest are copied into the shared results directory.
17. Agent2 receives a notification message and reads the released output.

---

## Security Architecture in Research Context

### What Rho Gets Right

#### 1. Breaking the Lethal Trifecta (Willison) / Satisfying the Rule of Two (Meta)

The lethal trifecta requires three capabilities to coincide for a dangerous exploit: (A) untrusted input, (B) private data access, and (C) external communication.

Rho's architecture **structurally prevents any single LLM from holding all three simultaneously**:

- **Agent2 (compute agent)**: Has [A] untrusted input (it writes and reads code) and [C] external communication (it can message Agent1). But it **never sees private data [B]** — only mock data and released aggregates.
- **Agent1 (data owner)**: Has [B] private data access and [A] untrusted input (it reviews Agent2's code/requests). But its **protected actions are not LLM-executed** — host-side deterministic code performs the actual data operations, breaking [C] for the LLM.
- **Host relay**: Has [B] private data access and [C] can write to shared space. But it **processes no untrusted LLM-generated input** — it executes only pre-validated, typed action payloads with strict schemas.

This is a **[BC] configuration with controlled [A]** in Meta's terminology — the strongest practical configuration for this use case.

#### 2. Deterministic Policy Enforcement (Google's Layer 1)

The controlled-action relay (`rho-controlled-action-relay.py`) functions as a **deterministic policy engine** operating outside the LLM's reasoning process, exactly as Google's hybrid defense-in-depth recommends:

- Actions are intercepted before execution and validated against strict schemas
- Only two action types are permitted: `run_real_data` and `release_results`
- All paths are validated against allowlisted prefixes (`/input/`, `/output/`, `/messages/`, etc.)
- Path traversal is blocked with segment-level validation
- The relay deterministically decides: allow, block, or require human confirmation
- The LLM cannot self-approve or escalate privileges

#### 3. Control Flow Integrity (CaMeL Pattern)

Rho implements a variant of the **Code-Then-Execute** and **Dual LLM** patterns from the Design Patterns paper:

- The LLM (inside the sandbox) acts as the **planning/proposing layer** — it can write code, compose requests, and reason about schemas
- The **host-side relay** acts as the execution layer — it runs validated scripts deterministically against real data
- The sandboxed agent writes structured intent files to an outbox; the host reads, validates, and executes them
- This is analogous to CaMeL's separation of the Privileged LLM (planning) from the Quarantined LLM (data processing), with the added benefit that the execution layer is not an LLM at all — it's deterministic code

#### 4. Agent Observability (Google's Principle 3)

- Gondolin provides HTTP audit logging (domains, methods, URLs, headers, bodies)
- The controlled-action outbox/inbox provides a complete machine-readable trace of all protected operations
- Request/approval/release manifests create an append-only audit trail
- Fixture capture and replay enables offline forensic analysis

#### 5. Human Controllers (Google's Principle 1)

- Every protected action passes through explicit human approval (Telegram, CLI, or fixture policy)
- The approval middleware is transport-agnostic — can be replaced with Slack, webhook, desktop UI
- Approvals bind to specific request IDs and action IDs, not blanket permissions
- The system is fail-closed: no approval = no execution

#### 6. Twin Dataset / Mock-First Development

This is a practical implementation of **data minimization for agent development**:
- Agents develop and test against structurally identical but non-sensitive mock data
- Real data is only touched inside controlled, audited, approved execution paths
- The mock/real linkage via UUID ensures schema compatibility without data exposure

### Where Rho Has Gaps

#### 1. Data Flow Manipulation Within the Sandbox

CaMeL's key insight is that even with control flow integrity, an attacker can manipulate the **data flow** — changing the arguments to tool calls without changing the plan. Rho's sandboxed agent writes controlled-action payloads, and while the host validates the schema, the **content values** (paths, summaries, reasons) are still LLM-generated.

A prompt injection inside the sandbox could potentially:
- Reference a different input path than the one the operator expects (if multiple valid paths exist under allowed prefixes)
- Craft misleading `summary` and `reason` fields to trick human approval
- Manipulate `template_vars` values within their validated constraints

Note that prompt injection in shared inter-agent messages (YAML files in `.rho/inbox/`) is **not a separate vulnerability** — the LLM reading those messages runs inside the same sandbox with no private data mounted. Even a fully compromised LLM can only write controlled-action payloads to `/control/outbox/`, which are validated by host-side code. The message content is untrusted input, but the sandbox is already the quarantine boundary (the Dual LLM pattern). The defense is the same: host-side validation of the outbox payload.

**Mitigation: Host-side action manifest binding.** The relay should cross-validate every controlled-action payload against a pre-authorized **expected action manifest** written by the scenario/request setup (outside the sandbox). This manifest specifies the exact allowed values for paths, identities, and action types for a given `request_id`. The relay rejects any payload that deviates. Additionally, code digests should be verified before execution to prevent time-of-check/time-of-use attacks, and the human approval display should be built from the host-side manifest rather than from LLM-generated summary/reason fields.

#### 2. Human Approval Fatigue

Multiple papers (CaMeL, Google, Design Patterns) flag **user fatigue** as a critical weakness of approval-gated systems. If an operator routinely approves actions (especially in high-volume scenarios), they may rubber-stamp a malicious or manipulated request.

Rho's fixture-based auto-approve mode is explicitly designed for testing, but in production use with Telegram, the operator is the single point of failure. The system provides no:
- Automated anomaly detection on requests
- Comparison of current request against historical baselines
- Rate limiting on approval requests
- Progressive trust (auto-approve after N consistent approvals for the same code)

#### 3. Credential Isolation Is Incomplete

As documented in `docs/security.md`, the current `--auth-json-source` path exposes real credentials to the guest process. Gondolin's placeholder secret substitution can handle header-based auth but not the OAuth refresh token flow used by Pi's `openai-codex` provider (which sends the token in a POST body).

This means the sandboxed agent currently has access to credentials it could theoretically exfiltrate to any allowlisted host — a violation of the principle of minimal privilege.

#### 4. No Formal Output Filtering

When results are released, there is no automated check that the output doesn't contain private data. The `release_results` action simply copies the file. The system relies entirely on:
- Human review at the release approval gate
- The assumption that aggregate outputs (e.g., a single sum) are safe

The red team document (`docs/redteam.md`) explicitly identifies this gap and plans for attack scenarios involving:
- Result laundering through human-readable summaries
- Differencing attacks across multiple approved queries
- Membership inference from aggregate statistics

But these defenses are not yet implemented.

#### 5. Static Evaluation Limitations

Per "The Attacker Moves Second," any defense evaluated only against fixed scenarios provides a false sense of security. Rho's current test suite (`tests/test-two-console-demo-agent-approval.sh`) runs predetermined scenarios with known-good agents. There is no:
- Adversarial agent testing with adaptive attackers
- Automated red-teaming infrastructure
- Benchmark against established frameworks like AgentDojo

The red team document designs this capability but it is not yet built.

### What Is Novel

#### 1. Filesystem-as-Protocol with Explicit Trust Zones

Most agent security research focuses on in-process isolation (CaMeL's interpreter, Dual LLM patterns, guard classifiers). Rho takes a different approach: **the trust boundary is the filesystem mount boundary.**

- Private data lives in directories that are never mounted into agent sandboxes
- Shared state lives in a directory both agents can access
- Control channels (outbox/inbox) are mounted directories with strict path validation
- The sandbox (Gondolin microVM) enforces mount permissions at the hypervisor level, not the application level

This is closer to a **multi-party computation** model than a traditional agent security pattern. The filesystem protocol is inspectable, replayable, and auditable without any LLM infrastructure.

#### 2. Controlled Action Relay as a Non-LLM Execution Layer

Most frameworks discussed in the literature (CaMeL, Dual LLM, Design Patterns) still use LLMs somewhere in the execution path — even CaMeL uses a Quarantined LLM to extract structured data. Rho's controlled-action relay is **entirely deterministic**: `python3 <script> <input>`, file copy, manifest write. No LLM is involved in the protected execution path.

This means prompt injection against the relay is structurally impossible — it processes JSON, not natural language. The attack surface is traditional software security (input validation, path traversal), not LLM security.

#### 3. Twin Dataset Abstraction

The mock/real twin dataset pattern is not discussed in the surveyed literature but addresses a practical gap: **how do you let an untrusted agent develop useful code without ever seeing real data?** The UUID-linked twin ensures schema fidelity between development and production data while maintaining a clean data boundary.

#### 4. Transport-Agnostic Approval Middleware

The separation of approval logic from approval transport (Telegram, CLI, fixture, future Slack/webhook) is architecturally clean. The fixture adapter is particularly notable: it enables deterministic CI testing of the full controlled-action flow without any human or network dependency, while using the exact same relay code path as production.

#### 5. Integrated Red Team Framework Design

The `docs/redteam.md` document designs a structured adversarial testing harness specifically for the statistical inference attacks that aggregate-release systems are vulnerable to (differencing, averaging, membership inference). This goes beyond prompt injection testing to address **the information-theoretic leakage** inherent in releasing any derived output from private data — a concern that the surveyed papers largely ignore in favor of prompt injection alone.

---

## Mapping to Paper Recommendations

| Recommendation | Rho Status |
|---------------|------------|
| Never trust model-level defenses alone | **Met.** Host-side deterministic execution for all protected ops. |
| Apply the Rule of Two | **Met.** No LLM has all three of: untrusted input + private data + external communication. |
| Use application-specific agents over general-purpose | **Met.** Agents are scoped to specific roles (data-owner, compute-agent) with constrained tool surfaces. |
| Separate control flow from data flow | **Partial.** Control flow is separated (host relay executes, not LLM). Data flow within sandbox is not tracked with CaMeL-style capabilities. Host-side action manifest binding provides semantic validation of LLM-proposed values. |
| Implement deterministic policy enforcement | **Met.** Controlled-action relay validates schemas, paths, action types deterministically. |
| Require human confirmation for high-risk actions | **Met.** All protected actions require approval via middleware. |
| Apply least privilege dynamically | **Partial.** Sandbox mounts are static per-run configuration, not dynamically scoped to task context. |
| Log everything | **Met.** HTTP audit logs, controlled-action traces, request/approval manifests, fixture capture. |
| Constrain outputs from untrusted data processing | **Partial.** Controlled actions have typed schemas. Host-side manifest binding constrains LLM-generated content values to pre-authorized sets. Code digest verification prevents TOCTOU attacks. |
| Sanitize rendered output | **Not applicable** (no browser rendering). Shared messages are untrusted input but read inside a sandboxed LLM with no private data — the sandbox is the quarantine boundary. |
| Use adaptive attacks, not static benchmarks | **Planned** (redteam.md) but **not yet implemented.** |
| Include human red-teaming | **Planned** but **not yet implemented.** |

---

## Summary

Rho is a practical system that enforces the core security principles identified across the surveyed literature — primarily by keeping LLMs out of the protected execution path entirely and using filesystem-level isolation rather than in-process language model sandboxing. Its main architectural strength is that it treats the LLM as an untrusted proposer rather than a trusted executor, which sidesteps most prompt injection attack vectors by design.

Its main gaps are in the areas that the literature also identifies as unsolved: human approval fatigue, output privacy guarantees, and adversarial evaluation. Host-side action manifest binding addresses semantic data flow validation by cross-checking LLM-proposed actions against pre-authorized manifests. The red team framework design shows awareness of the remaining gaps, but the implementation is still in progress.

The twin-dataset and filesystem-as-protocol patterns are practical innovations not directly addressed in the surveyed papers, and the fully deterministic (non-LLM) execution layer for protected actions is a stronger guarantee than most proposed frameworks achieve.