cellos-supervisor 0.5.0

CellOS execution-cell runner — boots cells in Firecracker microVMs or gVisor, enforces narrow typed authority, emits signed CloudEvents.
Documentation
# cellos-supervisor

The CellOS runner. Takes an `ExecutionCellDocument`, brings up a cell on a
host backend, enforces network/identity/secret policy, and emits a typed
CloudEvent for every step it takes.

## What it is

`cellos-supervisor` is primarily a binary (`src/main.rs`) — the composition
root that wires a host backend, a secret broker, an event sink, and one or
more export sinks into a single [`Supervisor`] (`src/supervisor.rs:38`)
that owns the cell lifecycle:

```
network_scope → trust_plane_observability (optional) → secrets →
lifecycle.started → run → export → destroy → revoke → lifecycle.destroyed
```

It sits across L4–L6 of the layer model: above the L1 ports defined by
`cellos-core`, above the L2/L3 host-backend abstractions (`cellos-host-*`),
sinks (`cellos-sink-*`), brokers (`cellos-broker-*`), and exports
(`cellos-export-*`), and below `cellos-server` and `cellos-cortex` which
treat the supervisor as an opaque emitter of CloudEvents. With ~15,000 lines
of Rust across two-dozen modules, this is the biggest crate in the
workspace. This README is a map, not a manual — for any module you care
about, the linked source is the source of truth.

What `cellos-supervisor` deliberately does NOT do:

- It does not own the spec vocabulary. Every type (`ExecutionCellDocument`,
  `AuthorityBundle`, `PolicyPackSpec`, `RunSpec`) comes from `cellos-core`.
- It does not own the wire format of CloudEvents. Every event is built by a
  `*_data_v1` / `cloud_event_v1_*` function from `cellos-core::events`.
- It does not host an HTTP server. Operators talk to `cellos-server`; the
  supervisor only publishes events.
- It does not depend on Cortex. The Cortex bridge lives in `cellos-cortex`
  and links *to* the supervisor, never the other way around (ADR-0008).
- It does not run on a `cellos-lite` build with local LLM dependencies —
  the inference broker port is implemented by external crates.

## Public API surface

The crate is a binary; `lib.rs` (`src/lib.rs:1`) exposes only the modules
integration tests need:

- `dns_proxy` — the SEAM-1 / L2-04 DNS proxy. Forward-only UDP, enforces
  `dnsAuthority.hostnameAllowlist` at the protocol layer, emits one
  `dns_query` CloudEvent per query. `src/dns_proxy/mod.rs:1`.
  Submodules: `parser`, `upstream`, `spawn`, `dnssec`.
- `sni_proxy` — TLS SNI / H2 `:authority` evaluator. `src/sni_proxy/`.
- `resolver_refresh` — host-controlled DNS resolver refresh with TTL
  watchdog and drift CloudEvent emission. `src/resolver_refresh/`.
- `ebpf_flow` — scaffolding for the eBPF/nflog per-flow listener (Phase 2).
  `src/ebpf_flow.rs:1`.
- `event_signing` — Ed25519/HMAC per-event signing wrapper. The public
  posture mirror (`event_signing_posture::SigningConfig`,
  `src/lib.rs:62`) is `doc(hidden)` and exists only so integration tests
  can pin the `Zeroizing<Vec<u8>>` invariant on key material.
- `linux_cgroup` — cgroup v2 helpers (target_os = "linux").
- `nft_counters` — nftables counter readers for network-enforcement events.
- `per_flow` — real-time per-flow nflog listener.
- `destruction_evidence` — terminal-state evidence builder for the
  `lifecycle.destroyed` event.
- `spec_input` — read `ExecutionCellDocument` from stdin/file +
  spec-hash computation. `src/spec_input.rs`.
- `trust_keyset_load` — load `SignedTrustKeysetEnvelope` from
  `CELLOS_TRUST_KEYSET_PATH` and verify against the operator-supplied
  keyring.
- `host_telemetry` (re-export of `cellos_host_telemetry`) — F1a Path B
  host-side probes + F3b vsock listener (per ADR-0006 §5.4).
- `__a2_02::resolve_caller_identity``doc(hidden)` mirror of
  `composition::resolve_caller_identity` for an integration test that
  pins the `CELLOS_CALLER_IDENTITY` → trim → `"default"` fallback
  contract.

Everything else lives in the binary's private module tree:

- `composition` — env-driven wiring (host backend, broker, sinks, exports,
  policy/authz/authority/trust keys). `src/composition.rs`.
- `supervisor` — the lifecycle orchestrator. `src/supervisor.rs:38`.
- `supervisor_helpers` — helpers for cell-spec destructuring, target
  resolution, redaction. `src/supervisor_helpers.rs`.
- `network_policy`, `linux_isolation`, `linux_mount`, `linux_net`,
  `linux_seccomp` — the Linux dataplane and isolation primitives.
- `runtime_secret`, `proxy_activation`, `command_runner`,
  `trust_plane_observability` — the rest of the run-phase machinery.

## Architecture / how it works

```
                ┌───────────────────────────────────────────────────────┐
                │ main.rs:                                              │
                │  - parse argv / stdin spec                            │
                │  - validate_execution_cell_document                   │
                │  - verify_authority_derivation                        │
                │  - enforce_derivation_scope_policy                    │
                │  - build_supervisor (composition.rs)                  │
                │  - emit_startup_banner                                │
                │  - Supervisor::run(spec)                              │
                └─────────────────────────┬─────────────────────────────┘
       ┌──────────────────────────────────────────────────────────────┐
       │ Supervisor (src/supervisor.rs)                               │
       │                                                              │
       │   host:        Arc<dyn CellBackend>      ← cellos-host-*     │
       │   broker:      Arc<dyn SecretBroker>     ← cellos-broker-*   │
       │   event_sink:  Arc<dyn EventSink>        ← cellos-sink-*     │
       │   jsonl_sink:  Option<Arc<dyn EventSink>>← cellos-sink-jsonl │
       │   exports:     HashMap<String, Arc<dyn ExportSink>>          │
       │   policy_pack: Option<PolicyPackSpec>     (admission gate)   │
       │   authz_policy:Option<AuthorizationPolicy>(RBAC, ADR-0007)   │
       │   authority_keys, trust_verify_keys:Arc<HashMap<...>>         │
       │                                                              │
       │   lifecycle:                                                 │
       │      network_scope                                           │
       │      trust_plane_observability                               │
       │      secrets (mount / env / runtime lease)                   │
       │      lifecycle.started                                       │
       │      run → command_completed / observability.*               │
       │      export → export_completed_v2 / export_failed_v2         │
       │      destroy → revoke → lifecycle.destroyed (always)         │
       └──────────────────────────────────────────────────────────────┘
                  ┌───────────────────────────────────────────┐
                  │  CloudEventV1 → primary event_sink →      │
                  │  optional jsonl_sink (mirror)             │
                  │  → JetStream / file / DLQ / redacted      │
                  └───────────────────────────────────────────┘
```

Teardown semantics: `destroy` and `revoke_for_cell` are called
*unconditionally* even when a phase error has already been captured
(`src/supervisor.rs:5`). Residue classes (`ResidueClass`, `LifecycleResidueClass`)
on the terminal event let the projector and operator audit what was left
behind.

The DNS proxy (`src/dns_proxy/`) is forward-only UDP. It parses each query,
evaluates `dnsAuthority.hostnameAllowlist` (literal or single-leading-`*.`
wildcard), and either forwards verbatim to the declared upstream or builds
a REFUSED response. Every observed query is emitted as a CloudEvent built
by `cellos_core::cloud_event_v1_dns_query`. SERVFAIL is synthesized on
upstream timeout so workloads see deterministic failure
(`src/dns_proxy/mod.rs:23`).

## Configuration

The supervisor is configured almost entirely through environment variables.
This is a representative slice (run `cargo doc --open -p cellos-supervisor`
for the full list):

| Env var | Default | Effect |
|---|---|---|
| `CELLOS_STRICT_CONFIG` | unset | When truthy, refuse to start if any env var fell back to a default. Useful in CI. `src/composition.rs:96`. |
| `CELLOS_CALLER_IDENTITY` | `"default"` | RBAC subject used by `authz_policy`. Empty/whitespace → `"default"`. `src/composition.rs:148`. |
| `CELLOS_CELL_BACKEND` | host-cellos | Selects host backend: `cellctl`, `firecracker`, or `stub`. `src/composition.rs:157`. |
| `CELLOS_BROKER` | `env` | Secret broker: `env`, `file`, `oidc`, `vault`. `src/composition.rs:170`. |
| `CELLOS_EXPORT_DIR` | unset | When set, mount the local-FS export sink. `src/composition.rs:183`. |
| `CELLOS_EXPORT_HTTP_BASE_URL` | unset | When set, mount the HTTP export sink. |
| `CELLOS_DEPLOYMENT_PROFILE` | `hardened` | Deployment profile (`hardened` / `permissive`). Hardened auto-sets several `REQUIRE_*` flags. `src/composition.rs:231`. |
| `CELLOS_POLICY_PACK_PATH` | unset | Path to the policy-pack JSON; loaded into `Supervisor.policy_pack`. |
| `CELLOS_AUTHZ_POLICY_PATH` | unset | Path to the authorization-policy JSON (ADR-0007). |
| `CELLOS_AUTHORITY_KEYS_PATH` | unset (required in `hardened`) | Operator-supplied role → Ed25519 verifying-key map. |
| `CELLOS_TRUST_VERIFY_KEYS_PATH` | unset | Trust-keyset signer kid → verifying-key map. |
| `CELLOS_TRUST_KEYSET_PATH` | unset | Signed trust-keyset envelope. |
| `CELLOS_REQUIRE_AUTHORITY_DERIVATION` | unset (auto in hardened) | Reject specs without a derivation token. |
| `CELLOS_REQUIRE_SCOPED_DERIVATION_TOKENS` | unset (auto in hardened) | Refuse non-scoped derivation tokens. |
| `CELLOS_REQUIRE_TELEMETRY_DECLARED` | unset (auto in hardened) | Require `telemetry.declared` in every spec. |
| `CELL_OS_USE_NOOP_SINK` | unset | Force the noop event sink (debugging). |
| `CELL_OS_JSONL_EVENTS` | unset | Mirror every event to a JSONL file (path-valued). |
| `CELL_OS_REQUIRE_JETSTREAM` | unset (auto in hardened) | Refuse to start if the JetStream sink can't connect. |
| `CELLOS_RUN_ID` | `run-local-001` (`validate` for `--validate`) | Stamped onto every event in this run. |
| `CELLOS_EVENT_SIGNING_*` | unset | Configure the I5 per-event signing wrapper. See `src/event_signing.rs`. |

Hardened profile defaults are documented at `src/composition.rs:253`.

## Examples

Run a spec under the stub backend with JSONL output:

```bash
CELLOS_CELL_BACKEND=stub \
CELLOS_BROKER=env \
CELL_OS_USE_NOOP_SINK=1 \
CELL_OS_JSONL_EVENTS=/tmp/events.jsonl \
CELLOS_DEPLOYMENT_PROFILE=permissive \
cargo run -p cellos-supervisor --bin cellos-supervisor -- /path/to/spec.yaml
```

Validate a spec without running it:

```bash
cargo run -p cellos-supervisor --bin cellos-supervisor -- --validate /path/to/spec.yaml
```

Project the resulting JSONL into a state snapshot:

```bash
cargo run -p cellos-projector --bin cellos-projector -- /tmp/events.jsonl --pretty
```

The `cellos-supervisor` binary's argv shape is contracted; see
`crates/cellos-supervisor/tests/argv_invariants.rs` for the typed
guarantees.

## Testing

```bash
cargo test -p cellos-supervisor
```

`crates/cellos-supervisor/tests/` carries ~80 integration tests covering
break-attempt scenarios (DNS rebinding, DNSSEC downgrade, kernel UDP 443,
SNI mismatch H2c, H2 CONTINUATION flood, post-isolation residue,
capability drop/grant), event invariants (lifecycle reason typed,
terminal state naming, manifest_failed, forced terminal exit code),
secret hygiene (zeroization, debug redaction, per-backend delivery
defaults), and trust-keyset behaviour.

Several tests are gated:

- `firecracker_e2e.rs` requires a local Firecracker binary and host root
  caps; it is `#[ignore]`d in default runs.
- The break-attempt tests that use nftables / nflog require Linux and
  run only on `target_os = "linux"`.

To run everything including ignored tests:

```bash
cargo test -p cellos-supervisor -- --include-ignored
```

The supervisor crate has its own preflight skill, `CellPreflight`, that
catches common Docker / Firecracker build mistakes before a 12-minute
rebuild.

## Related crates

- [`cellos-core`]../cellos-core/README.md — owns every spec/event type
  this crate consumes and emits.
- [`cellos-server`]../cellos-server/README.md — projects the
  CloudEvents this supervisor publishes; never imports this crate.
- [`cellos-host-cellos`]../cellos-host-cellos, `cellos-host-firecracker`,
  `cellos-host-stub` — the three `CellBackend` implementations selected
  by `CELLOS_CELL_BACKEND`.
- [`cellos-sink-jetstream`]../cellos-sink-jetstream,
  `cellos-sink-jsonl`, `cellos-sink-redact`, `cellos-sink-dlq` — the
  `EventSink` implementations layered behind the primary sink.
- [`cellos-broker-env`]../cellos-broker-env, `cellos-broker-file`,
  `cellos-broker-oidc`, `cellos-broker-vault``SecretBroker`
  implementations.
- [`cellos-export-local`]../cellos-export-local,
  `cellos-export-http`, `cellos-export-s3``ExportSink`
  implementations.
- [`cellos-host-telemetry`]../cellos-host-telemetry — Path B host-side
  probes + F3b vsock receiver, re-exported as `host_telemetry`.
- [`cellos-cortex`]../cellos-cortex/README.md — only crate allowed to
  import this one *across* the Cortex boundary.

## ADRs

- [ADR-0001]../../docs/adr/0001-rust-nats-jetstream-proprietary-host.md
  — NATS JetStream as the proprietary host substrate.
- [ADR-0004]../../docs/adr/0004-tls-termination-fronting-trust-boundary.md
  — TLS termination + fronting trust boundary (sni_proxy).
- [ADR-0005]../../docs/adr/0005-tls-termination-design.md — typed
  authority enforcement (the four variants admitted here).
- [ADR-0006]../../docs/adr/0006-in-vm-observability-runner-evidence.md
  — in-VM observability evidence + the F3b host-side vsock receiver.
- [ADR-0007]../../docs/adr/0007-rbac-secret-ref-admission.md  authorization policy + secret-ref admission.
- [ADR-0009]../../docs/adr/0009-cortex-doctrine-to-cellos-authority-mapping.md
  — doctrine → authority mapping, consumed by callers via `cellos-cortex`.
- [ADR-0010]../../docs/adr/0010-formation-authority-invariant.md  formation authority invariant.