cellos-fleet
Host-resident agent that polls an S3-backed spec queue and dispatches
execution cells to a local cellos-supervisor.
What it is
cellos-fleet is a single binary that runs on a fleet node and turns
"specs sitting in an S3 prefix" into "cells executed by the local
supervisor". It is the simplest possible work-distribution surface:
one S3 prefix per pool, key renaming as the claim primitive, no
control-plane database, no leader election.
Layer L3 (supervisor / agent). It sits beside cellos-supervisor on
the host — the supervisor knows how to execute one cell; the fleet
agent knows how to fetch the next spec from a shared bucket and hand
it off.
What it isn't:
- Not the Formations control plane. Formations (ADR-0010, ADR-0014)
are the typed, multi-cell control-plane resource served by
cellos-serverover HTTP and projected from JetStream bycellos-projector.cellos-fleetis orthogonal — it predates Formations and remains the right answer when you want pool-routed spec dispatch without standing up the full server + JetStream + projector stack. The two coexist; nothing incellos-fleetdepends oncellos-core's formation types. - Not a scheduler. It does no placement reasoning beyond a literal
poolIdequality filter. - Not a transactional queue. Claiming is
aws s3 mv(copy + delete), which is safe enough for low-concurrency single-agent deployments. A future revision can swap this for a DynamoDB conditional write.
Public API surface
This crate ships one binary (cellos-fleet) and no library API. The
operational interface is the queue layout plus the environment.
Queue model
An S3 prefix is treated as a four-state work queue, with key renaming as the state transition:
pending/<spec-id>.json → claimed/<spec-id>.json → supervisor runs
→ completed/<spec-id>.json (exit 0)
→ failed/<spec-id>.json (exit ≠ 0)
Environment variables
| Variable | Required | Default | Description |
|---|---|---|---|
CELLOS_FLEET_BUCKET |
yes | — | S3 bucket name. |
CELLOS_FLEET_PREFIX |
no | fleet |
Key prefix inside the bucket. |
CELLOS_FLEET_QUEUE_NAME |
no | (empty) | Optional named lane under the prefix; lets multiple agents service distinct lanes without stepping on each other. |
CELLOS_FLEET_POOL_ID |
no | (empty) | Runner pool identifier (T11 placement gate). When set, the dispatcher skips specs whose spec.placement.poolId is set AND does not equal this value. Specs with no poolId constraint are accepted everywhere. |
CELLOS_FLEET_SUPERVISOR |
no | cellos-supervisor |
Path to the supervisor binary the agent execs. |
CELLOS_FLEET_POLL_INTERVAL_MS |
no | 5000 |
Queue poll cadence. |
CELLOS_FLEET_HEARTBEAT_INTERVAL_MS |
no | 30000 |
Heartbeat cadence. |
CELLOS_FLEET_NODE_ID |
no | hostname | Unique node identifier for log attribution. |
The agent inherits AWS credentials from the environment (IAM role, env vars, or instance metadata) — it does not manage its own identity.
Architecture
The agent is a single tokio runtime running two tasks:
- Poll loop. Every
CELLOS_FLEET_POLL_INTERVAL_MSit listspending/, filters bypoolIdifCELLOS_FLEET_POOL_IDis set, then attempts to claim one spec by renaming its key intoclaimed/. On successful claim it execscellos-supervisorwith the spec on stdin and waits for exit. - Heartbeat. Every
CELLOS_FLEET_HEARTBEAT_INTERVAL_MSit emits a structured log line tagged with the node ID. This is the signal operators watch to distinguish "agent running with nothing to do" from "agent stuck or dead".
On SIGTERM the poll loop stops accepting new work; any in-flight
cell finishes normally before the process exits. A clean drain log
line is emitted so operators can distinguish graceful shutdown from a
crash.
The supervisor binary is invoked, not linked. This keeps the fleet agent decoupled from the supervisor's transitive deps (Linux-only host backends, eBPF, jailer) and lets the agent ship as a slim binary on any host that has AWS CLI access.
Configuration
All configuration is via the environment variables above. There is no
config file. The agent is meant to be deployed with a systemd unit (or
equivalent) whose [Service] section sets the env vars from your
secret + config manager of choice.
Examples
Run an agent that services every pool in s3://cellos-prod/fleet/:
export CELLOS_FLEET_BUCKET=cellos-prod
export AWS_REGION=us-west-2
cellos-fleet
Run an agent dedicated to a single pool, with a custom supervisor path:
export CELLOS_FLEET_BUCKET=cellos-prod
export CELLOS_FLEET_PREFIX=fleet
export CELLOS_FLEET_QUEUE_NAME=gpu-pool
export CELLOS_FLEET_POOL_ID=gpu-a100
export CELLOS_FLEET_SUPERVISOR=/opt/cellos/bin/cellos-supervisor
cellos-fleet
Submit a spec into the queue (operator side):
aws s3 cp ./my-spec.json s3://cellos-prod/fleet/pending/$(uuidgen).json
Testing
Integration tests live in tests/:
tests/cellos_fleet_happy_path.rs— full pending → claimed → completed transition.tests/cellos_fleet_status_filtering.rs—poolIdplacement gate.tests/cellos_fleet_invariants.rs— drain semantics, key naming.tests/config_from_env.rs— environment parsing.tests/smoke.rs— binary build +--helpsmoke.
Run:
cargo test -p cellos-fleet
The tests use bench-local stubs and tempfile-backed fixtures; no
real S3 or supervisor is required.
Related crates
cellos-supervisor— the binary the fleet agent execs per claimed spec.cellos-core— defines the spec types the supervisor validates after the fleet agent hands off.cellos-server— the alternative control plane (HTTP + JetStream + projections) for deployments that want typed Formations instead of an S3 prefix.
ADRs
- ADR-0001 — workspace decision: Rust + JetStream + proprietary host backend. The fleet agent is the "without-JetStream" entry path; it dispatches to the same supervisor.
- ADR-0010 —
Formation authority invariant. Explains the control-plane shape
cellos-fleetis NOT — useful when deciding which dispatch model fits a deployment. - ADR-0011 — the HTTP control-plane alternative.
See also docs/runner-first-class.md and docs/runner-stack.md for
the runner-pool placement model (poolId).