# cellos-fleet
Host-resident agent that polls an S3-backed spec queue and dispatches
execution cells to a local `cellos-supervisor`.
## What it is
`cellos-fleet` is a single binary that runs on a fleet node and turns
"specs sitting in an S3 prefix" into "cells executed by the local
supervisor". It is the simplest possible work-distribution surface:
one S3 prefix per pool, key renaming as the claim primitive, no
control-plane database, no leader election.
Layer L3 (supervisor / agent). It sits beside `cellos-supervisor` on
the host — the supervisor knows how to execute one cell; the fleet
agent knows how to fetch the next spec from a shared bucket and hand
it off.
What it isn't:
- **Not the Formations control plane.** Formations (ADR-0010, ADR-0014)
are the typed, multi-cell control-plane resource served by
`cellos-server` over HTTP and projected from JetStream by
`cellos-projector`. `cellos-fleet` is orthogonal — it predates
Formations and remains the right answer when you want pool-routed
spec dispatch without standing up the full server + JetStream +
projector stack. The two coexist; nothing in `cellos-fleet` depends
on `cellos-core`'s formation types.
- Not a scheduler. It does no placement reasoning beyond a literal
`poolId` equality filter.
- Not a transactional queue. Claiming is `aws s3 mv` (copy + delete),
which is safe enough for low-concurrency single-agent deployments.
A future revision can swap this for a DynamoDB conditional write.
## Public API surface
This crate ships one binary (`cellos-fleet`) and no library API. The
operational interface is the queue layout plus the environment.
### Queue model
An S3 prefix is treated as a four-state work queue, with key renaming
as the state transition:
```
pending/<spec-id>.json → claimed/<spec-id>.json → supervisor runs
→ completed/<spec-id>.json (exit 0)
→ failed/<spec-id>.json (exit ≠ 0)
```
### Environment variables
| `CELLOS_FLEET_BUCKET` | yes | — | S3 bucket name. |
| `CELLOS_FLEET_PREFIX` | no | `fleet` | Key prefix inside the bucket. |
| `CELLOS_FLEET_QUEUE_NAME` | no | (empty) | Optional named lane under the prefix; lets multiple agents service distinct lanes without stepping on each other. |
| `CELLOS_FLEET_POOL_ID` | no | (empty) | Runner pool identifier (T11 placement gate). When set, the dispatcher skips specs whose `spec.placement.poolId` is set AND does not equal this value. Specs with no `poolId` constraint are accepted everywhere. |
| `CELLOS_FLEET_SUPERVISOR` | no | `cellos-supervisor` | Path to the supervisor binary the agent execs. |
| `CELLOS_FLEET_POLL_INTERVAL_MS` | no | `5000` | Queue poll cadence. |
| `CELLOS_FLEET_HEARTBEAT_INTERVAL_MS` | no | `30000` | Heartbeat cadence. |
| `CELLOS_FLEET_NODE_ID` | no | hostname | Unique node identifier for log attribution. |
The agent inherits AWS credentials from the environment (IAM role, env
vars, or instance metadata) — it does not manage its own identity.
## Architecture
The agent is a single tokio runtime running two tasks:
1. **Poll loop.** Every `CELLOS_FLEET_POLL_INTERVAL_MS` it lists
`pending/`, filters by `poolId` if `CELLOS_FLEET_POOL_ID` is set,
then attempts to claim one spec by renaming its key into
`claimed/`. On successful claim it execs `cellos-supervisor` with
the spec on stdin and waits for exit.
2. **Heartbeat.** Every `CELLOS_FLEET_HEARTBEAT_INTERVAL_MS` it emits
a structured log line tagged with the node ID. This is the signal
operators watch to distinguish "agent running with nothing to do"
from "agent stuck or dead".
On `SIGTERM` the poll loop stops accepting new work; any in-flight
cell finishes normally before the process exits. A clean drain log
line is emitted so operators can distinguish graceful shutdown from a
crash.
The supervisor binary is invoked, not linked. This keeps the fleet
agent decoupled from the supervisor's transitive deps (Linux-only
host backends, eBPF, jailer) and lets the agent ship as a slim binary
on any host that has AWS CLI access.
## Configuration
All configuration is via the environment variables above. There is no
config file. The agent is meant to be deployed with a systemd unit (or
equivalent) whose `[Service]` section sets the env vars from your
secret + config manager of choice.
## Examples
Run an agent that services every pool in `s3://cellos-prod/fleet/`:
```
export CELLOS_FLEET_BUCKET=cellos-prod
export AWS_REGION=us-west-2
cellos-fleet
```
Run an agent dedicated to a single pool, with a custom supervisor
path:
```
export CELLOS_FLEET_BUCKET=cellos-prod
export CELLOS_FLEET_PREFIX=fleet
export CELLOS_FLEET_QUEUE_NAME=gpu-pool
export CELLOS_FLEET_POOL_ID=gpu-a100
export CELLOS_FLEET_SUPERVISOR=/opt/cellos/bin/cellos-supervisor
cellos-fleet
```
Submit a spec into the queue (operator side):
```
aws s3 cp ./my-spec.json s3://cellos-prod/fleet/pending/$(uuidgen).json
```
## Testing
Integration tests live in `tests/`:
- `tests/cellos_fleet_happy_path.rs` — full pending → claimed →
completed transition.
- `tests/cellos_fleet_status_filtering.rs` — `poolId` placement gate.
- `tests/cellos_fleet_invariants.rs` — drain semantics, key naming.
- `tests/config_from_env.rs` — environment parsing.
- `tests/smoke.rs` — binary build + `--help` smoke.
Run:
```
cargo test -p cellos-fleet
```
The tests use bench-local stubs and `tempfile`-backed fixtures; no
real S3 or supervisor is required.
## Related crates
- [`cellos-supervisor`](../cellos-supervisor/) — the binary the fleet
agent execs per claimed spec.
- [`cellos-core`](../cellos-core/README.md) — defines the spec types
the supervisor validates after the fleet agent hands off.
- [`cellos-server`](../cellos-server/) — the alternative control
plane (HTTP + JetStream + projections) for deployments that want
typed Formations instead of an S3 prefix.
## ADRs
- [ADR-0001](../../docs/adr/0001-rust-nats-jetstream-proprietary-host.md) —
workspace decision: Rust + JetStream + proprietary host backend.
The fleet agent is the "without-JetStream" entry path; it dispatches
to the same supervisor.
- [ADR-0010](../../docs/adr/0010-formation-authority-invariant.md) —
Formation authority invariant. Explains the control-plane shape
`cellos-fleet` is NOT — useful when deciding which dispatch model
fits a deployment.
- [ADR-0011](../../docs/adr/0011-cellos-server-http-control-plane.md) —
the HTTP control-plane alternative.
See also `docs/runner-first-class.md` and `docs/runner-stack.md` for
the runner-pool placement model (`poolId`).