# unit: Work Execution Model — Design Decision Record
*Recorded 2026-06-06. This captures decisions about what `unit` is *for* and how work flows through it. It is a north-star record, not an implementation spec — it states the destination, not the next commit.*
## What a unit is
A unit is the smallest self-contained code that can self-replicate, network, and perform functions. On native, a unit is a real OS process: its own network identity, its own colony, able to fork a full copy of itself. The in-process `--multi-unit` colony (many units sharing one runtime) is retained as a scale/stress *testing* tool, honestly labeled — it is not the canonical form of a unit.
## What `unit` is for (the decision)
`unit` is a zero-dependency, decentralized, self-organizing distributed computation fabric. The unit of work is a program. Scheduling, load-balancing, fault-tolerance, and scaling are *emergent from each node's local resource decisions and gossip* — there is no central coordinator, scheduler, master, or control plane. This is the contribution: most distributed compute has a coordinator; `unit` has none.
The purpose is extrinsic: a unit is a substrate for doing work given from outside, not solely an ALife system that maintains itself. The self-replication, placement, and mesh machinery are the infrastructure that makes externally-supplied work resilient and distributable.
## Self-replication boundary (safety + honesty)
- **Within a host:** a unit self-replicates for real — local process fork via the already-present binary (`std::env::current_exe()` + `std::process::Command`, zero-dep). Fully autonomous, fully safe because the binary is already local. Recursive (copies can replicate) and self-limiting under two independent brakes — the resource ceiling and the energy economy (see below).
- **Across hosts:** adding a host is a *human* act — copy the `unit` binary over and run it; it then does everything a `cargo install unit` unit would. The mesh NEVER ships binaries. A unit cannot acquire new ground on its own.
- **State across the mesh:** unit state (pattern) transports freely between running units — that's data, governed by existing trust/admission logic.
This boundary is deliberate: the dangerous capability was never "fork a process," it was "autonomously place my binary on a machine I chose." That one combination is made structurally impossible. Nothing scientifically interesting is lost; the worm hazard is removed by construction. (Consequence: no on-the-wire binary-hash-verification or code-trust tiers are needed, because code never crosses the wire.)
## Two independent replication brakes (hardware-confirmed 2026-06-08)
Replication self-limits under two brakes that measure different things and fail independently:
- **The resource ceiling** is *extrinsic and shared*: a fresh host measurement (memory + load) taken at spawn/admission time, refusing at/over 80% and failing closed when the host can't be measured. It protects the host from all units collectively, no matter how the pressure arose.
- **The energy economy** is *intrinsic and per-unit*: every spawn costs `SPAWN_COST` from the parent's ledger, the child starts with at most a third of what remains, and a unit that cannot afford the cost is refused with "insufficient energy to spawn." It bounds any single lineage's growth to its energy income, even on a host with abundant headroom.
An accidental `8 SPAWN-N` during the 2026-06-08 hardware session confirmed the second brake live: parent energy depleted per spawn (267 → 201 → 134 → 1 → −65 observed across the batch) and the batch halted at "insufficient energy to spawn" — 8 requested, 6 real children produced. The ceiling never had to fire.
How they interact: the energy check runs first (a unit that can't pay doesn't get as far as measuring the host), the ceiling second; either refuses alone. They also cover each other's blind spots in time — the ceiling reads a measurement that can lag a rapid burst (the admission margin exists to absorb that gap), while energy is debited synchronously on every spawn, so a tight replication loop is stopped by its own ledger before the host measurement would even move. Conversely, a unit with a full ledger on a saturated host is still refused by the ceiling. Both are refusal walls, not targets; both fail closed.
## The layering
- **Fabric** — units, mesh, self-replication, resource-aware placement. Speaks Forth internally.
- **Protocol** — s-expressions over the wire: the stable, polyglot, homoiconic contract for "here is work" / "here is a result." Trivial to parse on both ends, code-and-data same shape (fits a program-shipping system and Forth's character), decouples the controller from each unit's accumulated Forth vocabulary.
- **Controller** — any client in any language that speaks the s-expr protocol; submits jobs, receives results. This is the ingress/egress edge that makes the fabric usable.
- **REPL** — a human dropping into one unit to watch and poke. The microscope, NOT the controller. (The native TUI idea — chatter above, REPL below, zero-dep via std + one termios ioctl + ANSI scroll region, raw-mode restore guard built first — wraps this.)
## Work execution model (the core decision)
Every unit has the same uniform capability — no coordinator role exists:
1. A unit receives an s-expression instruction. The source is indistinguishable and irrelevant: it may come from the controller or from a peer that recruited it — *same interface*.
2. It works on what it can. **If the problem exceeds its own hands, it locally recruits peers with headroom** — using the existing placement logic — handing them sub-instructions in the same s-expr protocol. Recruitment is a local decision about the unit's own work, exactly as placement and replication are local decisions about its own resources.
3. Distribution is therefore recursive and emergent: a recruit whose share is still too big recruits further. The work fans out organically; no one planned it. It self-limits against the resource ceiling because a unit recruits only if it can find a peer with headroom.
4. **Results flow back up the recruitment tree** — the distribution structure IS the return routing. Each unit aggregates its recruits' results with its own and returns upward, until the answer reaches the original asker. Egress falls out of ingress structure for free.
## Fault model: let it crash (Erlang's lead)
Workers are dumb and crashable; resilience lives in the supervision relationship, not in worker error-handling. **The recruiter is the supervisor of its recruits** — it already knows what it asked for and is already waiting.
- A recruit's correctness is binary: it either returns a result or it is dead (silent). No partial/limping states to handle.
- Death detection reuses the existing mesh gossip/heartbeat timeout (the same `PEER_TIMEOUT` prune already in the codebase). On a recruit's death, the recruiter re-recruits that sub-problem to another peer with headroom.
- Supervision nests along the recruitment tree: if a recruiter itself dies, whoever recruited *it* re-recruits its whole subtree, recursively, up to the controller. An Erlang supervision tree that emerged from the recruitment structure rather than being designed separately.
- **Minimum required state:** a recruiting unit holds the sub-instruction it handed out until it gets a result back, so it can re-recruit if the recruit vanishes. It holds the *what*, never the *how-far*. This is the same shape as confirm-before-release in the transport layer (don't discard work until landing is confirmed) — applied to computation.
### Wedged peers: the job timeout (decided 2026-06-09)
Gossip-death covers crash; a peer that still heartbeats but never replies needs a second, slower clock. Decisions:
- **Two timescales, one mechanism.** Gossip-death (`PEER_TIMEOUT`, 15s) re-recruits when the holder vanishes; the job timeout (`RECRUIT_TIMEOUT`, 60s per slot, restarted on each assignment) re-recruits when the holder is live but silent. Both flow through the same re-recruit path; the wedged holder is excluded from candidates (unlike a dead one, it could be re-chosen). No candidate → reset the deadline and keep waiting, fail-closed.
- **Fixed, not scaled to the job.** A healthy worker's flat evaluation is already wall-clock bounded *on the worker* (`execution_timeout`), so the recruiter needs no cost model for arbitrary s-exprs. The exception is deliberate: a nested `(parallel ...)` defers its reply until its own recruits finish — *outside* the worker's eval bound — so a deep-but-healthy subtree can exceed 60s of silence and be redundantly re-recruited. That is correct (first-write-wins, below) but not free: re-recruiting a `(parallel ...)` duplicates the **whole subtree** — recruit messages, energy spend, and nested evals at every level, not one eval. Each recruiter in the tree supervises its own edge with the same constant, so recovery stays local to the wedged edge rather than restarting from the root.
- **At-least-once execution, first-write-wins collection.** Never cancel a wedged peer (a cancel message to a wedged peer is exactly the message that won't be processed). The same `(goal_id, seq)` may execute twice; replies are judged by identity, not sender — whichever lands first settles the slot and its job-slot, later duplicates are dropped silently. If the "wedged" peer was merely slow and finishes first, its result is used.
- **Orphaned sub-recruits are accepted waste.** When a wedged mid-tree peer's own recruits complete, they reply to a parent that never reports up — their results die there. This is named here so it doesn't read as a leak in logs: it is the bounded cost of refusing distributed cancellation.
- **Observability over caps.** Every slot counts its reassignments (gossip-death or timeout), surfaced by `RECRUITS`, so re-recruit cycling between wedged peers is visible on hardware. No reassignment cap in v1 — each cycle costs one message per timeout period, and a cap would be a guess; add one only if cycling is actually observed. (Extended after the 2026-06-10 hardware session: fail-closed deadline resets are counted and surfaced too — without that, "timeout pass failing closed every 60s" and "timeout pass not firing at all" rendered identically as a bare pending slot.)
### Hardware-validated (2026-06-10, three 512MB+2GB-swap droplets, binary e61165d)
The full wedged-peer path above was observed end to end on real hardware, with zero operator input during recovery. The wedge was honest — issuer A full (no local headroom), peer C killed, holder B carrying a recruited `(parallel (200 200 200))` subtree via the Deferred path with no placement candidates anywhere. Observed on A, in order:
1. Initial supervision re-recruit (`re-recruited 1x`).
2. The fail-closed loop, visible exactly as designed: `timeout expired, no candidate with headroom — deadline reset (1x/2x/3x/4x), still held by <B>` printing asynchronously on the ~60s cadence.
3. C rejoined with fresh headroom; the next expiry found a candidate and re-recruited the subtree to C (`re-recruited 2x`).
4. C evaluated and replied nested; `settle_nested` closed the slot with the `<nested parallel-result>` summary; the root job completed with correctly nested results.
5. Final slot line: `ok ... (re-recruited 2x) (deadline reset 4x)`.
The deadline-reset observability also retroactively resolved the prior day's ambiguity: the silent multi-minute hold on the input-gated binary had been the pass *not firing*; the timer-driven pass speaks every cycle. With this run, v0.32's recruit / supervision / timeout machinery — placement by honest headroom, gossip-death and job-timeout re-recruits, wedged-holder exclusion, fail-closed holds, nested-subtree settlement, first-write-wins, and both observability counters — is hardware-validated end to end.
## Open consideration: placement proportionality (recorded 2026-06-10, no action yet)
Observed on hardware: first-sufficient placement keeps the largest part local whenever it fits the local tally, exporting only the trivial parts — a node at 92% headroom kept a 600MB subtree on a 456MB box while shipping its 50MB leaf to a peer. Tier-2 placement (most-headroom among ABUNDANT peers) is never consulted when tier-1 says "I fit", so part size and destination capacity are never matched against each other.
The candidate refinement is size-aware ordering: place the largest parts where headroom is greatest, **including remote** — i.e. rank parts by estimated size descending and assign against peers (and self) ranked by headroom descending, rather than greedily keeping whatever fits locally. Costs to weigh before acting: it needs a per-part size estimate (the same easy-to-get-wrong footprint problem the admission TODO defers), it trades locality (no network hop) for balance, and the observed behavior is *correct* under the resource model — just lopsided. Recorded so the next placement pass starts here; not acted on because no failure has yet been attributed to it.
## Build order (smallest first; each a reviewable milestone)
1. **s-expr ingress on a single unit** — receive an s-expression instruction, evaluate in the Forth VM, return an s-expr result. No distribution, no mesh. The seam everything hangs off.
2. **Recruitment primitive** — a unit hands a sub-instruction to one peer and gets a result back.
3. **Let-it-crash supervision** — re-recruit on peer timeout.
## Principles held throughout
Zero dependencies. No central coordinator (every behavior is a local decision from local pressure + gossip). Honesty selected, not policed. Confirm before release. Fail closed. The resource ceiling is a refusal wall, not a target — and it governs replication and work-recruitment alike, not just placement.
## Explicitly rejected during design (do not revisit without reason)
- **Thousands of small units as the goal** — replaced by depth over breadth; the size/count tradeoff is a continuous knob the workload turns, not a fixed target.
- **Binary self-replication across the wire** — worm hazard; the binary boundary (local-only forking) is the containment.
- **Each-unit-is-each-host / triad-per-host** — considered and dropped; units are processes, many can share a host, replication is local and ceiling-governed.
- **A per-job coordinator role** — replaced by uniform recruit-when-needed, which needs no special unit.
- **Shared ephemeral-unit pool across co-located units** — would require coordination/shared state, violating self-containment; each unit owns its own colony.