cellos-fleet 0.5.0

S3-queue fleet dispatch agent for CellOS — pulls pending cell specs from S3, claims them, hands off to a local cellos-supervisor.
cellos-fleet-0.5.0 is not a library.

cellos-fleet

Host-resident agent that polls an S3-backed spec queue and dispatches execution cells to a local cellos-supervisor.

What it is

cellos-fleet is a single binary that runs on a fleet node and turns "specs sitting in an S3 prefix" into "cells executed by the local supervisor". It is the simplest possible work-distribution surface: one S3 prefix per pool, key renaming as the claim primitive, no control-plane database, no leader election.

Layer L3 (supervisor / agent). It sits beside cellos-supervisor on the host — the supervisor knows how to execute one cell; the fleet agent knows how to fetch the next spec from a shared bucket and hand it off.

What it isn't:

  • Not the Formations control plane. Formations (ADR-0010, ADR-0014) are the typed, multi-cell control-plane resource served by cellos-server over HTTP and projected from JetStream by cellos-projector. cellos-fleet is orthogonal — it predates Formations and remains the right answer when you want pool-routed spec dispatch without standing up the full server + JetStream + projector stack. The two coexist; nothing in cellos-fleet depends on cellos-core's formation types.
  • Not a scheduler. It does no placement reasoning beyond a literal poolId equality filter.
  • Not a transactional queue. Claiming is aws s3 mv (copy + delete), which is safe enough for low-concurrency single-agent deployments. A future revision can swap this for a DynamoDB conditional write.

Public API surface

This crate ships one binary (cellos-fleet) and no library API. The operational interface is the queue layout plus the environment.

Queue model

An S3 prefix is treated as a four-state work queue, with key renaming as the state transition:

pending/<spec-id>.json   →  claimed/<spec-id>.json  →  supervisor runs
                                                     →  completed/<spec-id>.json   (exit 0)
                                                     →  failed/<spec-id>.json      (exit ≠ 0)

Environment variables

Variable Required Default Description
CELLOS_FLEET_BUCKET yes S3 bucket name.
CELLOS_FLEET_PREFIX no fleet Key prefix inside the bucket.
CELLOS_FLEET_QUEUE_NAME no (empty) Optional named lane under the prefix; lets multiple agents service distinct lanes without stepping on each other.
CELLOS_FLEET_POOL_ID no (empty) Runner pool identifier (T11 placement gate). When set, the dispatcher skips specs whose spec.placement.poolId is set AND does not equal this value. Specs with no poolId constraint are accepted everywhere.
CELLOS_FLEET_SUPERVISOR no cellos-supervisor Path to the supervisor binary the agent execs.
CELLOS_FLEET_POLL_INTERVAL_MS no 5000 Queue poll cadence.
CELLOS_FLEET_HEARTBEAT_INTERVAL_MS no 30000 Heartbeat cadence.
CELLOS_FLEET_NODE_ID no hostname Unique node identifier for log attribution.

The agent inherits AWS credentials from the environment (IAM role, env vars, or instance metadata) — it does not manage its own identity.

Architecture

The agent is a single tokio runtime running two tasks:

  1. Poll loop. Every CELLOS_FLEET_POLL_INTERVAL_MS it lists pending/, filters by poolId if CELLOS_FLEET_POOL_ID is set, then attempts to claim one spec by renaming its key into claimed/. On successful claim it execs cellos-supervisor with the spec on stdin and waits for exit.
  2. Heartbeat. Every CELLOS_FLEET_HEARTBEAT_INTERVAL_MS it emits a structured log line tagged with the node ID. This is the signal operators watch to distinguish "agent running with nothing to do" from "agent stuck or dead".

On SIGTERM the poll loop stops accepting new work; any in-flight cell finishes normally before the process exits. A clean drain log line is emitted so operators can distinguish graceful shutdown from a crash.

The supervisor binary is invoked, not linked. This keeps the fleet agent decoupled from the supervisor's transitive deps (Linux-only host backends, eBPF, jailer) and lets the agent ship as a slim binary on any host that has AWS CLI access.

Configuration

All configuration is via the environment variables above. There is no config file. The agent is meant to be deployed with a systemd unit (or equivalent) whose [Service] section sets the env vars from your secret + config manager of choice.

Examples

Run an agent that services every pool in s3://cellos-prod/fleet/:

export CELLOS_FLEET_BUCKET=cellos-prod
export AWS_REGION=us-west-2
cellos-fleet

Run an agent dedicated to a single pool, with a custom supervisor path:

export CELLOS_FLEET_BUCKET=cellos-prod
export CELLOS_FLEET_PREFIX=fleet
export CELLOS_FLEET_QUEUE_NAME=gpu-pool
export CELLOS_FLEET_POOL_ID=gpu-a100
export CELLOS_FLEET_SUPERVISOR=/opt/cellos/bin/cellos-supervisor
cellos-fleet

Submit a spec into the queue (operator side):

aws s3 cp ./my-spec.json s3://cellos-prod/fleet/pending/$(uuidgen).json

Testing

Integration tests live in tests/:

  • tests/cellos_fleet_happy_path.rs — full pending → claimed → completed transition.
  • tests/cellos_fleet_status_filtering.rspoolId placement gate.
  • tests/cellos_fleet_invariants.rs — drain semantics, key naming.
  • tests/config_from_env.rs — environment parsing.
  • tests/smoke.rs — binary build + --help smoke.

Run:

cargo test -p cellos-fleet

The tests use bench-local stubs and tempfile-backed fixtures; no real S3 or supervisor is required.

Related crates

  • cellos-supervisor — the binary the fleet agent execs per claimed spec.
  • cellos-core — defines the spec types the supervisor validates after the fleet agent hands off.
  • cellos-server — the alternative control plane (HTTP + JetStream + projections) for deployments that want typed Formations instead of an S3 prefix.

ADRs

  • ADR-0001 — workspace decision: Rust + JetStream + proprietary host backend. The fleet agent is the "without-JetStream" entry path; it dispatches to the same supervisor.
  • ADR-0010 — Formation authority invariant. Explains the control-plane shape cellos-fleet is NOT — useful when deciding which dispatch model fits a deployment.
  • ADR-0011 — the HTTP control-plane alternative.

See also docs/runner-first-class.md and docs/runner-stack.md for the runner-pool placement model (poolId).