ilo 26.5.0

ilo - the token-minimal programming language AI agents write
Documentation
# The ilo Manifesto

## The Audience Is Not Human

Every programming language in use today was designed for people. The syntax, the error messages, the tooling - all optimised for a brain that reads left-to-right, tracks visual indentation, and cares about aesthetics.

AI agents are not that brain. They produce tokens sequentially. They consume tokens from a finite context window. Every token they spend - generating, reading, retrying - costs real time and real money.

ilo is designed for them.

## The Only Metric

**Total tokens from intent to working code.**

```
Total cost = spec loading + generation + context loading + error feedback + retries
```

Every design decision is evaluated against this number. If a feature reduces it, it's in. If it increases it, it's out. No exceptions for elegance, readability, or convention. Note: until agents are trained on ilo, spec clarity is itself a token cost - a confusing spec means more retries. Some decisions that look like "readability" concessions are actually optimising the spec-loading term.

## The Six Principles

### 1. Token-Conservative

The north star. Every choice evaluated against total token cost across the full loop - not just "short syntax," but including retries, error feedback, and context loading.

A named argument like `amount: 42` costs more tokens than positional `42`. We initially worried positional args would cause parameter-swap errors - but across 10 syntax variants and 4 task types, positional args scored 10/10 generation accuracy. The swap concern was unfounded. Positional args are the single biggest token saver.

**What the agent cares about:** "How many tokens will this cost me end-to-end?"
**How this helps:** The language is as terse as possible *without increasing retry rate*. Where there's a tradeoff between generation cost and error rate, we optimise for total cost.

**Prefix notation** eliminates parentheses and saves tokens at every nesting level. `(a * b) + c` becomes `+*a b c` - 4 fewer characters, 1 fewer token. Deeper nesting saves more: `((a + b) * c) >= 100` becomes `>=*+a b c 100` - 7 fewer characters, 3 fewer tokens. Across 25 expression patterns, prefix notation saves 22% of tokens and 42% of characters vs infix. See the [prefix-vs-infix benchmark](research/explorations/prefix-vs-infix/) for the full analysis.

**Guards instead of if/else** eliminate nesting depth. In a traditional language, conditional logic stacks:

```python
if a:
    if b:
        if c:
            return x
```

Each level adds indentation, a closing brace, and more state for the agent to track. In ilo, guards are flat statements that return early and chain vertically:

```
>=a 0 x   -- if a >= 0, return x; otherwise continue
>=b 0 y   -- if b >= 0, return y; otherwise continue
z         -- default return
```

No nesting. No closing braces to match. Each guard is a single statement the agent can emit and forget. Depth stays constant regardless of how many conditions there are.

**Match instead of switch** eliminates fall-through - a common source of bugs in C-style languages where missing a `break` causes execution to bleed into the next case. In ilo, each match arm is independent and exhaustive. There is no fall-through because there is no execution path between arms.

**Naming rule:** prefer single-word identifiers. Across all major LLM tokenisers (OpenAI, Anthropic), common English words are 1 token. Hyphenated compounds are always 2 - the hyphen forces a token split. Every hyphen in a name doubles its token cost. Abbreviations (`uid` vs `user`) save characters but not tokens - tokenisers encode common words as single tokens either way. Both styles score 10/10 in generation accuracy.

**On hyphenated stdlib names — by design, not contradiction.** Stdlib ships a small number of hyphenated builtins: `default-on-err`, `now-ms`, `get-many`, `dur-parse`, `dur-fmt`, `env-all`, `dtparse-rel`, `rgxall-multi`, `b64-dec`, `b64u-dec`, `hmac-sha256`, `ct-eq`, `last-dom`, `next-business-day`, `day-of-week`, `get-to`, `mget-or`, `lget-or`, `rand-bytes`, and a handful of others. This appears to contradict the naming rule, but the contradiction dissolves under the same token-cost lens applied to the residual English keywords: stdlib builtins are a **closed, memorised vocabulary** learned from the spec, not generated by pattern. An agent does not hallucinate `default-on-err`; it either knows it from the spec or does not. The semantic load of `default-on-err` over `doe`, `dur-parse` over `durp`, or `hmac-sha256` over `hs256` is real: shorter opaque aliases scored lower on first-use generation accuracy in internal testing because agents conflated them with other single-token identifiers. For user-written identifiers the rule holds — agents generate names freely, and every hyphen is an extra token they pay every time. For the stdlib the spec-loading cost of an ambiguous short alias outweighs the per-call token saving. **The set of hyphenated stdlib names is frozen: no new hyphenated builtins will be added.** Existing names will not be renamed without a deprecation window; the status quo is intentional.

### 2. Constrained

Small vocabulary. Closed world. Few ways to do things.

When an agent generates the next token, how many valid options are there? Fewer valid next-tokens means fewer wrong choices means fewer retries. This isn't about limiting expressiveness - it's about making the right token obvious.

- **Closed world.** Every callable function is known ahead of time. The agent cannot hallucinate an API that doesn't exist.
- **Small vocabulary.** Fewer keywords, fewer constructs, one way to define a function, one way to call it, one way to handle errors. Where multiple forms exist (braceless guards vs braced guards, prefix ternary vs braced ternary, short aliases vs long aliases), each serves a different context - inline vs file, simple vs complex - rather than offering interchangeable alternatives.
- **Verification before execution.** All calls resolve, all types align, all dependencies exist - checked before running anything.

**What the agent cares about:** "At each generation step, how many valid tokens are there?"
**How this helps:** The language becomes a set of rails. Constrained generation can feed valid next-token sets back to the agent, making it *impossible* to generate invalid code.

**On stdlib depth — acknowledged tension, not a contradiction.** "Small vocabulary" and "closed world" are real constraints, but they trade against a competing pressure: *stdlib depth*. Agents working on non-trivial tasks reach for common operations — byte-reversal, pairwise iteration, quantile aggregation, multi-key lookups — and when the builtin doesn't exist they hand-roll a workaround. That workaround costs tokens (typically 40–100 extra) and introduces surface area for errors, exactly what the Constrained principle is meant to prevent. The honest read: a vocabulary that is *too* small forces agents to reinvent primitives, which is its own form of unconstrained generation.

The resolution is a decision criterion rather than a blanket policy. A new builtin is warranted when: **(a)** it appears as a hand-rolled workaround in **≥ 3 independent persona transcripts**, **(b)** each workaround costs **> 40 tokens** to express, and **(c)** there is no composition of existing builtins that reduces the workaround below that threshold. Below that bar, user code is the right home. Above it, adding the builtin reduces total token cost — consistent with the Constrained principle's own logic. Gaps that cross the threshold are tracked in [`persona-runs/ab-shared-issues.md`](persona-runs/ab-shared-issues.md); new proposals should be triaged against these criteria rather than argued on intuition. **The principle remains: the builtin set is closed and small. What changes is the explicit acknowledgement that "small" is a measurement outcome, not a fixed number, and that the measurement is grounded in dogfood data.**

### 3. Self-Contained

Each unit carries its own context: deps, types, rules.

An agent working on function A shouldn't need to load functions B through Z to understand what A does. The less context required per step, the fewer tokens consumed, the more of the context window is available for the actual task.

- **Explicit dependencies.** Each function declares exactly what it needs - by name, with types. No globals, no ambient state, no implicit imports. I/O boundaries (`env`, `rd`, `now`) are the deliberate exception - they access the outside world, but the program itself has no mutable shared state.
- **Small units.** A function that fits in a few dozen tokens can be loaded, understood, and modified cheaply.
- **Spec as context.** Until foundation models are trained on ilo, agents need the spec somewhere they can access it - bundled with the program, fetched on demand, or installed locally.

**What the agent cares about:** "How much context do I need to load to work on this unit?"
**How this helps:** Minimal context loading per task. Each unit is self-describing. The agent never needs to hunt for definitions elsewhere.

### 4. Language-Agnostic

Minimise dependency on English or any natural language.

Early variants used short English-derived keywords (`fn`, `let`, `match`, `for`, `if`). Experiments showed structural tokens outperform keywords entirely - the winning syntax (idea8/idea9) replaced control-flow keywords with single-character sigils:

- `?` conditional/match, `!` auto-unwrap, `~` ok-wrap, `^` err/throw, `@` iterate, `>` pipe/return
- Control-flow keywords reduced to abbreviations: `wh`, `ret`, `brk`, `cnt`
- Builtins are short abbreviated names: `len`, `hd`, `tl`, `map`, `flt`, `fld`, `srt`, `cat`, `spl`, `rd`, `wr`, `env`, `prnt`, `fmt`, `str`, `num`, etc. These read as domain vocabulary, not English prose - an agent learns them as fixed tokens from the spec, not from natural-language understanding.
- Remaining keywords: `type`, `tool`, `with`, `use`, `nil`, `true`, `false`, plus type sigils `L`, `R`, `O`, `M`, `S`, `F`
- Agents learned the full vocabulary from spec + examples with 10/10 accuracy

Sigils won for control flow because they are unambiguous single tokens that cannot be confused with variable names or hallucinated into natural-language variations. Builtins use short names rather than sigils because there are too many to encode as single characters - but they are a closed, memorisable set, not open-ended English.

**On the residual English keywords — by design, not by oversight.** The remaining keywords (`type`, `tool`, `with`, `use`, `nil`, `true`, `false`, `wh`, `ret`, `brk`, `cnt`) are English-origin. This appears to contradict "language-agnostic," but the contradiction dissolves under the token-cost lens: replacing `true` with `+` or `nil` with `_` would save zero tokens (each is already a single token in every major tokeniser) while destroying recognisability in the spec, increasing spec-loading cost. Every alternative representation tested (sigils for booleans and nil, arbitrary single-chars for declaration keywords) scored lower on generation accuracy — agents confused them with operators and identifiers. The kept keywords are a fixed, closed set; an agent learns them from the spec as opaque tokens, not from English semantics. The principle is "minimise *dependency* on natural language" — these 11 keywords are not natural language in use, they are a memorised vocabulary. Replacing them would trade a theoretical purity gain for a real accuracy cost. **The set is frozen: no new English keywords will be added.**

**What the agent cares about:** "Can I learn this language from its spec and examples, regardless of my training?"
**How this helps:** The spec is small enough to bundle with any program. Keywords are learned from structure, not from natural language understanding.

### 5. Graph-Reducible

When analysing code, reduce context size by loading only the relevant subgraph.

Writing code costs the same tokens regardless of program size. But *reading* code - understanding what exists, what calls what, what breaks if you change something - scales with program size. In a traditional language, the agent loads entire files or entire repos to answer simple questions. ilo makes the dependency graph explicit and queryable, so the agent loads only what it needs.

- **Typed signatures as contracts.** Every function declares its params and return type. To understand what `create-user` does, an agent only needs its signature - not the 20 other functions in the file. Signatures are the graph's edges.
- **Explicit dependencies.** `use "auth.ilo" [vld-email vld-plan]` declares exactly what's imported. No scanning, no guessing. The call graph is derivable from the AST without execution.
- **Queryable structure.** `ilo graph` extracts call graphs, type dependencies, reverse callers, and transitive subgraphs as JSON. An agent modifying one function in a 30-function program loads 6-10% of the code instead of 100%.

**What the agent cares about:** "How much do I need to read before I can write?"
**How this helps:** The agent loads the target function's source, its dependencies' signatures, and the types it references - nothing more. As programs grow, the savings compound: the subgraph stays small even as the program gets large.

### 6. Structured Compiler-to-Agent Surface

Every path from the compiler to an agent is machine-readable by default.

Diagnostics, AST output, call graphs, fix plans, skill content, size reports — all of it ships as structured JSON. Not as an optional flag you might forget to pass, but as the default contract. Prose output exists for human TTYs; JSON is the agent path.

The concrete commitments:

- Every CLI subcommand has a `--json` mode. An agent driving `ilo check`, `ilo graph`, `ilo bench`, or any other subcommand gets a typed, parseable response with no screen-scraping.
- Every emitted artifact carries a stable `schemaVersion` field. Schemas evolve; the version field lets agents detect and adapt to changes rather than silently misparse them.
- Every diagnostic carries machine-readable fields: error code, source span, candidate fixes, and related locations. An agent reading a diagnostic knows exactly what went wrong, where it went wrong, and what to try next — without parsing human prose.
- Fix plans are typed ([ILO-360](https://linear.app/ilo-lang/issue/ILO-360/typed-fix-plans-fixsafety-taxonomy-phase-2)): each suggested repair carries a `FixSafety` classification so the agent can decide autonomously whether to apply it.
- Golden diagnostics and provenance matrices are structured ([ILO-363](https://linear.app/ilo-lang/issue/ILO-363/provenance-matrix-golden-file-diagnostics-phase-4)): regression tests compare JSON, not text, so the error contract is explicit and auditable.
- Closed-loop benchmarks emit structured cost tables ([ILO-364](https://linear.app/ilo-lang/issue/ILO-364/closed-loop-benchmark-ilo-vs-zero-per-task-economics-phase-5)): token cost per task, per phase, per engine — queryable, not just printable.

This principle is downstream of Constrained — a closed-world verifier is what makes deterministic, schema-stable output possible — but it earns its own line because the structured-output discipline is the single biggest driver of per-task cost reduction in cached steady-state. When an agent can parse one JSON response instead of retrying after a misread prose error, that saves more tokens than any syntax decision. Per the economics analysis in `zero-gap-specs/lessons-from-zero.md`, tooling structure dominates the steady-state cost table.

The corollary: future CLI surface additions ship `--json` from day one, not as a follow-up. Structured output is not polish — it is load-bearing.

**What the agent cares about:** "Can I parse the compiler's response without writing a regex?"
**How this helps:** Zero screen-scraping, zero retry cycles caused by format ambiguity. The agent reads typed JSON, acts on it, and moves on.

## Principles We Considered and Dropped

**Deterministic** - falls out naturally from constrained + self-contained. An agent doesn't think about determinism; it thinks "did this work?" If the language is constrained and self-contained, determinism follows.

**Append-only** - solved by small self-contained units. If units are small enough, regenerating them is cheap and safe. No need for a structural constraint.

**Immediate feedback** - a property of the runtime/tooling, not the language itself. Important for the ecosystem, but not a language principle.

## The Name

*ilo* is Toki Pona for "tool" ([sona.pona.la/wiki/ilo](https://sona.pona.la/wiki/ilo)).

Toki Pona is a constructed language built around radical minimalism. ~120 words. 14 phonemes. Complex ideas expressed by combining simple terms. It constrains human expression to force clarity of thought.

ilo does the same for machine programmers. A minimal, verified vocabulary. Complex programs built by composing small, self-contained units. The constraint is the feature.

## What ilo Is Not

**Not a framework for building AI agents.** There are plenty of those. ilo is a language for agents to write programs *in*.

**Not optimised for human readability.** Humans can read it - it's not obfuscated - but no decision is made because it "looks cleaner" or "reads more naturally." If a design is uglier but costs fewer total tokens, it wins. Newlines, indentation, and multi-line comments are human concerns - agents don't need them. An entire ilo program can be one line. The formatter provides expanded output (`--expanded` / `-e`) when humans need to review.

**Not theoretical.** Every principle here addresses measured failure modes in AI-generated code: hallucinated APIs, context window exhaustion, wasted retry cycles from vague errors.

## What ilo Is

A **minimal, verified action space** - the smallest set of constructs an agent needs to express computational intent, with relationships made explicit and everything else stripped away.

---

Read this on the docs site: [ilo-lang.ai/docs/manifesto](https://ilo-lang.ai/docs/manifesto/)