cellos-fleet 0.5.1

S3-queue fleet dispatch agent for CellOS — pulls pending cell specs from S3, claims them, hands off to a local cellos-supervisor.
# cellos-fleet

Host-resident agent that polls an S3-backed spec queue and dispatches
execution cells to a local `cellos-supervisor`.

## What it is

`cellos-fleet` is a single binary that runs on a fleet node and turns
"specs sitting in an S3 prefix" into "cells executed by the local
supervisor". It is the simplest possible work-distribution surface:
one S3 prefix per pool, key renaming as the claim primitive, no
control-plane database, no leader election.

Layer L3 (supervisor / agent). It sits beside `cellos-supervisor` on
the host — the supervisor knows how to execute one cell; the fleet
agent knows how to fetch the next spec from a shared bucket and hand
it off.

What it isn't:
- **Not the Formations control plane.** Formations (ADR-0010, ADR-0014)
  are the typed, multi-cell control-plane resource served by
  `cellos-server` over HTTP and projected from JetStream by
  `cellos-projector`. `cellos-fleet` is orthogonal — it predates
  Formations and remains the right answer when you want pool-routed
  spec dispatch without standing up the full server + JetStream +
  projector stack. The two coexist; nothing in `cellos-fleet` depends
  on `cellos-core`'s formation types.
- Not a scheduler. It does no placement reasoning beyond a literal
  `poolId` equality filter.
- Not a transactional queue. Claiming is `aws s3 mv` (copy + delete),
  which is safe enough for low-concurrency single-agent deployments.
  A future revision can swap this for a DynamoDB conditional write.

## Public API surface

This crate ships one binary (`cellos-fleet`) and no library API. The
operational interface is the queue layout plus the environment.

### Queue model

An S3 prefix is treated as a four-state work queue, with key renaming
as the state transition:

```
pending/<spec-id>.json   →  claimed/<spec-id>.json  →  supervisor runs
                                                     →  completed/<spec-id>.json   (exit 0)
                                                     →  failed/<spec-id>.json      (exit ≠ 0)
```

### Environment variables

| Variable | Required | Default | Description |
|---|---|---|---|
| `CELLOS_FLEET_BUCKET` | yes || S3 bucket name. |
| `CELLOS_FLEET_PREFIX` | no | `fleet` | Key prefix inside the bucket. |
| `CELLOS_FLEET_QUEUE_NAME` | no | (empty) | Optional named lane under the prefix; lets multiple agents service distinct lanes without stepping on each other. |
| `CELLOS_FLEET_POOL_ID` | no | (empty) | Runner pool identifier (T11 placement gate). When set, the dispatcher skips specs whose `spec.placement.poolId` is set AND does not equal this value. Specs with no `poolId` constraint are accepted everywhere. |
| `CELLOS_FLEET_SUPERVISOR` | no | `cellos-supervisor` | Path to the supervisor binary the agent execs. |
| `CELLOS_FLEET_POLL_INTERVAL_MS` | no | `5000` | Queue poll cadence. |
| `CELLOS_FLEET_HEARTBEAT_INTERVAL_MS` | no | `30000` | Heartbeat cadence. |
| `CELLOS_FLEET_NODE_ID` | no | hostname | Unique node identifier for log attribution. |

The agent inherits AWS credentials from the environment (IAM role, env
vars, or instance metadata) — it does not manage its own identity.

## Architecture

The agent is a single tokio runtime running two tasks:

1. **Poll loop.** Every `CELLOS_FLEET_POLL_INTERVAL_MS` it lists
   `pending/`, filters by `poolId` if `CELLOS_FLEET_POOL_ID` is set,
   then attempts to claim one spec by renaming its key into
   `claimed/`. On successful claim it execs `cellos-supervisor` with
   the spec on stdin and waits for exit.
2. **Heartbeat.** Every `CELLOS_FLEET_HEARTBEAT_INTERVAL_MS` it emits
   a structured log line tagged with the node ID. This is the signal
   operators watch to distinguish "agent running with nothing to do"
   from "agent stuck or dead".

On `SIGTERM` the poll loop stops accepting new work; any in-flight
cell finishes normally before the process exits. A clean drain log
line is emitted so operators can distinguish graceful shutdown from a
crash.

The supervisor binary is invoked, not linked. This keeps the fleet
agent decoupled from the supervisor's transitive deps (Linux-only
host backends, eBPF, jailer) and lets the agent ship as a slim binary
on any host that has AWS CLI access.

## Configuration

All configuration is via the environment variables above. There is no
config file. The agent is meant to be deployed with a systemd unit (or
equivalent) whose `[Service]` section sets the env vars from your
secret + config manager of choice.

## Examples

Run an agent that services every pool in `s3://cellos-prod/fleet/`:

```
export CELLOS_FLEET_BUCKET=cellos-prod
export AWS_REGION=us-west-2
cellos-fleet
```

Run an agent dedicated to a single pool, with a custom supervisor
path:

```
export CELLOS_FLEET_BUCKET=cellos-prod
export CELLOS_FLEET_PREFIX=fleet
export CELLOS_FLEET_QUEUE_NAME=gpu-pool
export CELLOS_FLEET_POOL_ID=gpu-a100
export CELLOS_FLEET_SUPERVISOR=/opt/cellos/bin/cellos-supervisor
cellos-fleet
```

Submit a spec into the queue (operator side):

```
aws s3 cp ./my-spec.json s3://cellos-prod/fleet/pending/$(uuidgen).json
```

## Testing

Integration tests live in `tests/`:

- `tests/cellos_fleet_happy_path.rs` — full pending → claimed →
  completed transition.
- `tests/cellos_fleet_status_filtering.rs``poolId` placement gate.
- `tests/cellos_fleet_invariants.rs` — drain semantics, key naming.
- `tests/config_from_env.rs` — environment parsing.
- `tests/smoke.rs` — binary build + `--help` smoke.

Run:

```
cargo test -p cellos-fleet
```

The tests use bench-local stubs and `tempfile`-backed fixtures; no
real S3 or supervisor is required.

## Related crates

- [`cellos-supervisor`]../cellos-supervisor/ — the binary the fleet
  agent execs per claimed spec.
- [`cellos-core`]../cellos-core/README.md — defines the spec types
  the supervisor validates after the fleet agent hands off.
- [`cellos-server`]../cellos-server/ — the alternative control
  plane (HTTP + JetStream + projections) for deployments that want
  typed Formations instead of an S3 prefix.

## ADRs

- [ADR-0001]../../docs/adr/0001-rust-nats-jetstream-proprietary-host.md  workspace decision: Rust + JetStream + proprietary host backend.
  The fleet agent is the "without-JetStream" entry path; it dispatches
  to the same supervisor.
- [ADR-0010]../../docs/adr/0010-formation-authority-invariant.md  Formation authority invariant. Explains the control-plane shape
  `cellos-fleet` is NOT — useful when deciding which dispatch model
  fits a deployment.
- [ADR-0011]../../docs/adr/0011-cellos-server-http-control-plane.md  the HTTP control-plane alternative.

See also `docs/runner-first-class.md` and `docs/runner-stack.md` for
the runner-pool placement model (`poolId`).