cellos-host-telemetry 0.5.0

Host-side telemetry receiver for CellOS — vsock listener that host-stamps and signs CloudEvents emitted by the in-guest cellos-telemetry agent.
Documentation
# cellos-host-telemetry

The host side of the in-VM observability pipeline — per-cell vsock UDS
listener, host-stamping, agent-silenced detection, host-side probes,
and supervisor-side per-event signing of outbound CloudEvents.

## What it is

`cellos-host-telemetry` is the supervisor-side receiver for the
observability path defined in
[ADR-0006 — in-VM observability runner evidence](../../docs/adr/0006-in-vm-observability-runner-evidence.md).
It is **not** the in-guest agent — that one lives in
[`cellos-telemetry`](../cellos-telemetry) — and the split is deliberate
(see "Channel-authenticity model" below).

The crate does five jobs:

1. **Bind a per-cell UDS** at `<vsock_uds_base>_9001` *before* the VM
   boots, mirroring the `_9000` exit-code listener pattern in
   `cellos-host-firecracker::listen_for_exit_code`
   (`crates/cellos-host-firecracker/src/lib.rs:1669`). Bind-before-boot
   is what makes the channel-authenticity primitive hold: the host
   trusts WHICH UDS path the bytes arrived on, not anything in the
   payload.
2. **Decode CBOR-framed guest declarations** (`src/listener.rs:199`)
   with a `content_version` major-version gate
   (ADR-0006 §12) — unknown majors are rejected with
   `TelemetryError::UnsupportedVersion`.
3. **Host-stamp every frame** (`src/host_stamp.rs:28`) so `cell_id`,
   `run_id`, `host_received_at`, and `spec_signature_hash` come
   exclusively from the supervisor. The `GuestDeclaration` type has no
   attribution fields at all, so a compromised guest cannot forge them
   (`src/lib.rs:96-108`).
4. **Detect agent silence** (`src/keepalive.rs`) — a `KeepAlive` tracker
   the listener pokes per-frame, a fire-once `AgentSilencedTrigger`, and
   a `watch_for_silence` watcher loop that fires
   `cell.observability.guest.agent_silenced` exactly once per run.
5. **Sign outbound envelopes** (`src/sign_outbound.rs`) — supervisor-side
   per-event signing using the canonical-JSON payload from
   `cellos_core::trust_keys::canonical_event_signing_payload`. Three
   modes (`Off`, `Hmac` HMAC-SHA256 FIPS 198, `Ed25519` via
   `ed25519-dalek`), driven by env vars.

It additionally exposes a small **host-probe** surface (`src/probes/`,
Slot F1a / Path B) — `HostProbe` trait + four built-in probes
(`fc_metrics`, `cgroup`, `nftables`, `tap_link`) that watch the cell
from outside the guest using primitives the supervisor already controls
(VMM `/metrics` endpoint, cgroup-v2 files, nftables counters, TAP link
state). These are the cross-witness for the guest's Path A
declarations (`src/probes/mod.rs:1-54`).

L2 sits in the [layer model](../../LAYERS.md) at "host runtime /
isolation"; this crate is the host half of the observability spine that
runs *next to* L2 and feeds the L3 supervisor's event sink.

What it deliberately does **not** do:

- It does **not** accept signing primitives the guest could use. Per
  ADR-0006 §5, the guest holds no key material — the supervisor signs.
  This is different from [`cellos-telemetry`]../cellos-telemetry,
  which is forbidden from depending on a signer (`src/lib.rs:18-23`).
- It does **not** trust ANY attribution field the guest may have
  stuffed into the wire payload. Unknown CBOR keys are silently dropped
  at decode (`src/listener.rs:181-217`, structurally enforced by
  `WireFrame` having only the five permitted fields).
- It does **not** write to disk, NATS, or any other sink directly. The
  outputs are values (`StampedDeclaration`, `AgentSilencedSignal`,
  `CloudEventV1`, `SignedEventEnvelopeV1`); the supervisor projects
  them onto the configured `EventSink`.

## Public API surface

Top-level (`src/lib.rs`):

| Item | Where |
|---|---|
| `pub const VSOCK_TELEMETRY_PORT: u32 = 9001` | `src/lib.rs:61` |
| `pub const WIRE_CONTENT_VERSION_MAJOR: u16 = 1` | `src/lib.rs:68` |
| `pub enum TelemetryError { Bind, Wire, UnsupportedVersion }` | `src/lib.rs:72` |
| `pub struct GuestDeclaration { probe_source, guest_pid, guest_comm, guest_monotonic_ns }` | `src/lib.rs:97` |
| `pub struct HostStamp { cell_id, run_id, host_received_at, spec_signature_hash }` | `src/lib.rs:113` |
| `pub struct HostProbeReading { probe, value_json, timestamp_ms }` | `src/lib.rs:134` |
| `pub struct StampedDeclaration { cell_id, run_id, host_received_at, spec_signature_hash, probe_source, guest_pid, guest_comm, guest_monotonic_ns }` | `src/lib.rs:150` |

Listener (`src/listener.rs`):

| Item | Where |
|---|---|
| `pub const MAX_FRAME_BYTES: u32 = 64 * 1024` | `src/listener.rs:50` |
| `pub struct VsockUdsListener` | `src/listener.rs:60` |
| `VsockUdsListener::bind_for_cell(&Path)`, `socket_path()`, `accept()` | `src/listener.rs:65-112` |
| `pub struct VsockUdsStream` | `src/listener.rs:115` |
| `VsockUdsStream::recv_stamped(&HostStamp, &KeepAlive)` | `src/listener.rs:128` |
| `VsockUdsStream::recv_guest_declaration()` | `src/listener.rs:153` |
| `pub fn decode_frame(body: &[u8]) -> Result<GuestDeclaration, TelemetryError>` | `src/listener.rs:199` |

Host-stamping (`src/host_stamp.rs`):

| Item | Where |
|---|---|
| `pub fn stamp(GuestDeclaration, HostStamp) -> StampedDeclaration` | `src/host_stamp.rs:28` |
| `pub fn stamp_now(GuestDeclaration, cell_id, run_id, spec_signature_hash)` | `src/host_stamp.rs:51` |

Keep-alive (`src/keepalive.rs`):

| Item | Where |
|---|---|
| `pub const DEFAULT_KEEPALIVE_WINDOW: Duration = Duration::from_secs(10)` | `src/keepalive.rs:36` |
| `pub struct KeepAlive` (`new`, `window`, `notify_frame`, `is_silenced`) | `src/keepalive.rs:41-79` |
| `pub struct AgentSilencedSignal` | `src/keepalive.rs:85` |
| `AgentSilencedSignal::CLOUDEVENT_TYPE = "dev.cellos.events.cell.observability.v1.guest.agent_silenced"` | `src/keepalive.rs:105` |
| `pub struct AgentSilencedTrigger` (`new`, `fire`, `has_fired`) | `src/keepalive.rs:138-179` |
| `pub async fn watch_for_silence(KeepAlive, Arc<AgentSilencedTrigger>, poll_interval: Duration)` | `src/keepalive.rs:190` |

Signing (`src/sign_outbound.rs`):

| Item | Where |
|---|---|
| `pub struct StampedDeclaration { guest, host }` (F4b local) | `src/sign_outbound.rs:73` |
| `pub const PROVENANCE_DECLARED: &str = "declared"` | `src/sign_outbound.rs:93` |
| `pub const ENV_SIGN_ALG / ENV_SIGN_KID / ENV_SIGN_HMAC_KEY / ENV_SIGN_ED25519_SK` | `src/sign_outbound.rs:98-104` |
| `pub enum SignOutboundError { InvalidConfig, Signer, Serialize }` | `src/sign_outbound.rs:108` |
| `pub enum SigningKeyMaterial { Off, Hmac { kid, key }, Ed25519 { kid, signing_key } }` | `src/sign_outbound.rs:127` |
| `pub enum SigningOutcome` | `src/sign_outbound.rs:283` |
| `pub fn host_stamped_envelope(...)` | `src/sign_outbound.rs:306` |
| `pub fn sign_host_stamped_envelope(...)` | `src/sign_outbound.rs:354` |
| `pub fn host_stamp_and_sign(...)` | `src/sign_outbound.rs:374` |

Probes (`src/probes/`):

| Item | Where |
|---|---|
| `pub const HOST_PROBE_EVENT_SOURCE = "cellos-host-telemetry/probes"` | `src/probes/mod.rs:78` |
| `pub const HOST_PROBE_EVENT_TYPE_PREFIX = "dev.cellos.events.cell.observability.host.v1"` | `src/probes/mod.rs:84` |
| `pub struct ProbeContext { cell_id, run_id, spec_signature_hash }` | `src/probes/mod.rs:93` |
| `pub struct ProbeReading` | `src/probes/mod.rs:125` |
| `pub enum ProbeError` | `src/probes/mod.rs:158` |
| `pub trait HostProbe` | `src/probes/mod.rs:191` |
| `pub fn build_host_probe_envelope(...) -> CloudEventV1` | `src/probes/mod.rs:233` |
| `pub fn emit_reading(...)` | re-exported from `src/lib.rs:50` |
| Built-in probes: `FcMetricsProbe`, `CgroupProbe`, `NftablesProbe`, `TapLinkProbe` | `src/probes/{fc_metrics,cgroup,nftables,tap_link}.rs` |

`#![deny(unsafe_code)]` and `#![warn(missing_docs)]` are enforced at
crate root (`src/lib.rs:36-37`).

## Architecture / how it works

**Channel-authenticity model (ADR-0006 §5).** Firecracker proxies the
guest's vsock connection to a per-cell UDS at
`<vsock_uds_base>_<port>`. The supervisor passes the *same* base path
through to this crate (so the telemetry UDS sits alongside the exit-code
UDS in one socket dir, making teardown a single `remove_dir_all`), and
the listener binds at `<base>_9001` before the workload's first
instruction. The host then trusts whichever stream the bytes arrived on
— payloads carry no attribution. This is structural: `GuestDeclaration`
literally has no `cell_id` / `run_id` / `spec_signature_hash` fields, and
a compile-time witness test in `src/lib.rs:186-206` keeps it that way.

**Wire format (ADR-0006 §12).** Each frame is `u32 LE length` + CBOR map
body. The map must contain `content_version: u16` (high byte = major,
low byte = minor) — only `WIRE_CONTENT_VERSION_MAJOR = 1` is accepted
today, and unknown majors are rejected with
`TelemetryError::UnsupportedVersion` (`src/listener.rs:203-209`). The
remaining permitted fields are `probe_source`, `guest_pid`,
`guest_comm`, `guest_monotonic_ns`; anything else is dropped at decode
because `WireFrame` (`src/listener.rs:186-193`) has nowhere to put it.
Frames whose body length is 0 or exceeds 64 KiB (`MAX_FRAME_BYTES`) are
rejected with `TelemetryError::Wire`.

**Per-frame stamping.** `VsockUdsStream::recv_stamped` re-stamps
`host_received_at` on every frame so the receive instant is accurate per
event, while `cell_id` / `run_id` / `spec_signature_hash` are per-run
and constant (`src/listener.rs:138-148`). The keep-alive tracker is
poked on every successful receive (`src/listener.rs:137`).

**Silence is observable.** `watch_for_silence` polls the `KeepAlive`
tracker at a configurable interval; when `last_frame_at.elapsed() >=
window` it calls `AgentSilencedTrigger::fire` exactly once. The
trigger's fire-once invariant is structural — a second `fire` call
returns `None` (`src/keepalive.rs:161-173`). The watcher reads
`elapsed` under the same critical section that checks the silence
condition, so a frame landing at the boundary cannot yield an
`elapsed_ms < keepalive_window_ms` in the emitted signal
(`src/keepalive.rs:202-211`).

**Signing (F4b).** The supervisor calls
[`host_stamp_and_sign`] (`src/sign_outbound.rs:374`) with a
`StampedDeclaration` + `SigningKeyMaterial`, gets back a
`SigningOutcome::Unsigned(CloudEventV1)` or
`SigningOutcome::Signed(SignedEventEnvelopeV1)`. Signing payload is the
canonical-JSON serialization of the FULL `CloudEventV1` per
`cellos_core::trust_keys::canonical_event_signing_payload`; mutating
any field after the signature is computed makes verification fail (I5 /
O2 doctrine). HMAC keys land in a verifier's `hmac_keys` map; Ed25519
public keys land in `verifying_keys`. `SigningKeyMaterial` implements a
custom `Debug` that NEVER prints key bytes — only the variant and the
kid — so an accidental `{:?}` in a tracing span cannot leak key material
(`src/sign_outbound.rs:147-150`).

**Host probes (Slot F1a / Path B).** Each `HostProbe` implementation
reads concrete inputs from the host (a file path, an endpoint, a table
name, an interface name) and returns a `ProbeReading` carrying
`probe_source`, `inputs`, and `output` blocks — D12 doctrine ("every
probe-emitted event is attributable to a probe and its concrete
inputs/outputs"). `build_host_probe_envelope` projects readings into
`cellos_core::CloudEventV1`s with `source = "cellos-host-telemetry/probes"`
and type `dev.cellos.events.cell.observability.host.v1.<probe>`. Per-
probe `read()` implementations are `#[cfg(target_os = "linux")]`; on
other targets they return `ProbeError::PlatformUnsupported`.

## Configuration

| Env var | Default | Effect |
|---|---|---|
| `CELLOS_HOST_TELEMETRY_SIGN_ALG` | `off` | One of `off`, `hmac-sha256`, `ed25519`. Anything else is rejected. (`src/sign_outbound.rs:98`) |
| `CELLOS_HOST_TELEMETRY_SIGN_KID` | required when alg != off | Signer kid embedded in `SignedEventEnvelopeV1`. (`src/sign_outbound.rs:100`) |
| `CELLOS_HOST_TELEMETRY_SIGN_HMAC_KEY` | required when alg=hmac-sha256 | Base64url (no-pad, padding tolerated) of the shared HMAC key. (`src/sign_outbound.rs:102`) |
| `CELLOS_HOST_TELEMETRY_SIGN_ED25519_SK` | required when alg=ed25519 | Base64url of the 32-byte Ed25519 seed. (`src/sign_outbound.rs:104`) |

Setting both `*_HMAC_KEY` and `*_ED25519_SK` is rejected — the operator
must pick one to avoid ambiguity over which key signed the stream
(`src/sign_outbound.rs:40-43`).

There is no env var for the listener itself; the UDS base path is
chosen by the calling backend (`cellos-host-firecracker`) and passed to
`VsockUdsListener::bind_for_cell`.

## Examples

Listener + host-stamping:

```rust
use std::path::Path;
use std::time::{Duration, SystemTime};
use cellos_host_telemetry::{
    listener::VsockUdsListener, keepalive::KeepAlive, HostStamp,
};

let listener = VsockUdsListener::bind_for_cell(Path::new(
    "/tmp/cellos-vsock-cell-42.socket",
))?;
let mut stream = listener.accept().await?;
let stamp = HostStamp {
    cell_id: "cell-42".into(),
    run_id: "run-7".into(),
    host_received_at: SystemTime::now(),
    spec_signature_hash: "sha256:deadbeef".into(),
};
let keepalive = KeepAlive::new(Duration::from_secs(10));
while let Some(stamped) = stream.recv_stamped(&stamp, &keepalive).await? {
    // stamped: StampedDeclaration with host-stamped attribution
    let _ = stamped;
}
# Ok::<(), cellos_host_telemetry::TelemetryError>(())
```

Silence watcher:

```rust
use std::sync::Arc;
use std::time::Duration;
use cellos_host_telemetry::keepalive::{
    AgentSilencedTrigger, KeepAlive, watch_for_silence,
};

let keepalive = KeepAlive::new(Duration::from_secs(10));
let trigger = Arc::new(AgentSilencedTrigger::new(
    "cell-42",
    "run-7",
    Duration::from_secs(10),
));
let signal = watch_for_silence(keepalive, trigger, Duration::from_millis(250)).await;
// signal: Option<AgentSilencedSignal> — Some(_) on first silence detection
```

Signing:

```rust
use cellos_host_telemetry::{
    sign_outbound::{host_stamp_and_sign, SigningKeyMaterial, SigningOutcome},
    GuestDeclaration, HostStamp,
};

let key_material = SigningKeyMaterial::from_env()?;
let outcome: SigningOutcome = host_stamp_and_sign(/* ...stamped declaration... */)?;
match outcome {
    SigningOutcome::Unsigned(cloudevent) => { /* emit as-is */ }
    SigningOutcome::Signed(envelope)     => { /* emit wrapped */ }
}
# Ok::<(), Box<dyn std::error::Error>>(())
```

## Testing

```
cargo test -p cellos-host-telemetry
```

In-source unit tests cover:

- Frame decode: unknown major rejected, known major accepted with
  unknown fields dropped, garbage rejected, UDS bind path, end-to-end
  round trip with attribution overwrite (`src/listener.rs:244-345`).
- Host stamping: host-stamped attribution overrides, `host_received_at`
  preserved when supplied explicitly (`src/host_stamp.rs:68-111`).
- Keep-alive: fresh tracker is not silenced, post-window is silenced,
  `notify_frame` resets timer, trigger fires exactly once, watcher fires
  after window (`src/keepalive.rs:215-277`).
- Constants pinned: vsock port 9001, wire major 1
  (`src/lib.rs:171-184`).

Integration tests under `tests/`:

| File | Scope |
|---|---|
| `smoke.rs` | Host-probe envelope builder, `emit_reading` against a no-op sink, wire-version / port constants. |
| `kill_the_agent.rs` | Agent-silenced detection end-to-end. |

No `#[ignore]` gating — the crate's tests all run on every CI leg
because the listener works against an in-process Unix Domain Socket
(no vsock required).

## Related crates

- [`cellos-telemetry`]../cellos-telemetry — the in-guest agent. Forbidden
  from depending on a signer; emits unsigned declarations over vsock.
- [`cellos-host-firecracker`]../cellos-host-firecracker — pairs the
  `_9000` exit-code UDS with the `_9001` telemetry UDS in the same
  per-cell socket dir.
- [`cellos-core`]../cellos-core`CloudEventV1`,
  `SignedEventEnvelopeV1`, `canonical_event_signing_payload`,
  `EventSink`, the trust-key sign/verify primitives.
- [`cellos-supervisor`]../cellos-supervisor — owns the receiver loop
  and the `EventSink` the projected envelopes flow into.

## ADRs

- [ADR-0006 — In-VM observability runner evidence]../../docs/adr/0006-in-vm-observability-runner-evidence.md
  — the doctrine reference for the entire host-receiver design.
  Specifically §5 (channel-authenticity), §6 (host-stamped attribution
  is non-negotiable), §7 (`agent_silenced` is an observable signal), and
  §12 (wire-schema versioning) are all enforced in this crate.