# Architecture
The *why* of forge. For *what* lives where, see [WORKFLOW.md](WORKFLOW.md) and the source. For specific past decisions, see [decisions/](decisions/).
## What forge is
A reusable Python library for self-hosted LLM tool-calling and multi-step agentic workflows. Forge owns the tool-calling loop — retry logic, context management, step enforcement, and client adapters. It does not own intent routing, model selection, or domain logic; downstream projects build those on top.
**Target hardware:** 12–32GB VRAM (consumer GPUs).
**Backends:** llama-server (llama.cpp), Llamafile, Ollama, Anthropic.
**Surfaced three ways:** `WorkflowRunner` (own the loop), OpenAI-compatible proxy (drop-in for existing harnesses), middleware (`Guardrails` facade for foreign loops).
---
## Design Principles
These are the load-bearing commitments. Most of forge's specific decisions fall out of these five.
### 1. Fail Fast, Fail Loud
No defensive coding. No silent `try/except`, no fallback defaults, no swallowed errors. If the model returns garbage, the retry loop handles it explicitly. If retries are exhausted, forge raises a typed exception with full context (attempt count, last error, last raw response). Silent failures in agentic loops are devastating — a swallowed error at step 3 corrupts every subsequent step.
```python
# BAD — defensive
try:
tool_call = parse_tool_call(response)
except Exception:
tool_call = ToolCall(tool="fallback", args={}) # Silent corruption
# GOOD — fail fast with context
try:
tool_call = parse_tool_call(response)
except ParseError as e:
raise ToolCallError(
f"Failed to parse tool call on attempt {attempt}/{max_retries}",
raw_response=response,
cause=e,
)
```
### 2. Explicit Over Implicit
All schemas defined with Pydantic. All LLM outputs validated before execution. Configuration is explicit: when forge auto-detects hardware, it logs what it detected and what budget it chose. The user can always override.
Cloud APIs absorb ambiguity gracefully. A 14B model at Q4 does not. Every implicit assumption is a failure mode.
Concrete consequence: `recommended_sampling=True` is opt-in, never default. With it on, an unknown model raises `UnsupportedModelError` rather than silently inheriting backend defaults. See [ADR-014](decisions/014-recommended-sampling-opt-in.md).
### 3. Control Flow Is Not Memory
Forge separates *what the model remembers* (message history, subject to compaction) from *what the runner enforces* (step completion, iteration count, terminal conditions). The model's context is a resource to be managed. Control-flow state is authoritative and lives outside the message history.
Concrete consequence: step completion is tracked in a `StepTracker` on the runner. Compaction may aggressively drop a tool result, but `StepEnforcer` checks `completed_steps` from the tracker, not from what the model "remembers."
**Tradeoff:** The model may redundantly re-call a tool whose result was compacted. This wastes an iteration but doesn't corrupt the workflow. Tools that are expensive to re-run should be idempotent — that's a downstream contract forge documents but doesn't enforce.
### 4. The Client Adapter Is the Abstraction Boundary
Forge doesn't know whether the LLM supports native function calling, prompt-injected tool calling, or some future protocol. The `LLMClient` adapter translates between forge's internal `ToolCall` representation and whatever the backend expects. The tool-calling loop receives validated `ToolCall` objects and never parses raw text.
This means rescue parsing (Mistral `[TOOL_CALLS]`, Qwen `<tool_call>` XML, fenced JSON, etc.) lives in the *client* — `runner.py` doesn't grow special cases per model family. See [ADR-005](decisions/005-parallel-tool-calls.md) for the batched-tool-call shape and [ADR-013](decisions/013-text-response-intent.md) for the synthetic respond tool's role in this boundary.
### 5. Context Is a First-Class Resource
On consumer hardware, KV cache competes with model weights for VRAM. A 15-step workflow can easily hit 10–20K tokens, pushing a 14B model at Q4 off GPU and into RAM (5–20× slower). Context management is not optional — it's load-bearing infrastructure.
Forge budgets context proactively. The compaction strategy is owned by the strategy object (not the manager), so swapping `TieredCompact` for `SlidingWindowCompact` or a custom strategy is a constructor change.
---
## Surface Modes
Three integration modes, three control / convenience tradeoffs. All three share the same underlying guardrail logic via the middleware layer.
```
forge.guardrails/ <-- extracted guardrail logic (shared)
^ ^
forge.proxy forge.core.runner
(proxy mode) (WorkflowRunner)
```
- **`WorkflowRunner`** (forge owns the loop) — full feature set: step enforcement, prerequisites, context compaction, threshold callbacks, cancellation, streaming, on_message observability. Best when building on forge directly.
- **Proxy server** (drop-in) — OpenAI-compatible `/v1/chat/completions` endpoint. Applies validation, rescue parsing, retry loop, and synthetic `respond` injection per request. Single-shot — workflow-spanning features (step enforcement, prerequisites, session memory) are out by design because the OpenAI chat-completions schema doesn't carry that state. See [ADR-012](decisions/012-openai-proxy.md).
- **Middleware** (`Guardrails` facade) — for callers running their own loop. Two-method API (`check()` / `record()`) wrapping `ResponseValidator`, `StepEnforcer`, `ErrorTracker`. Returns `Nudge` objects the caller routes however its framework expects. See [ADR-011](decisions/011-guardrail-middleware.md).
The middleware is the foundation; proxy and runner compose the same components.
---
## Guardrails: What They Are and Why
| **Rescue parsing** | Model emits a tool call in the wrong format (fenced JSON, Mistral `[TOOL_CALLS]`, Qwen XML) | Modern API expects structured `tool_calls`; older models still emit inline JSON. Without rescue, the call dies before reaching the tool. |
| **Response validation** | Tool name unknown, tool args malformed | Validates the model's intent before tool execution; routes the corrective message back on the canonical wire shape. |
| **Retry nudges** | Bare text instead of a tool call | Surfaced on the `user` channel — the model needs a positive instruction ("try a tool call") rather than a tool-error reply pattern. |
| **Step enforcement** | Premature `terminal_tool` call | Surfaced as a tool-error reply (`role="tool"`, `[StepEnforcementError]` prefix). Tool-error shape is what OpenAI-tool-trained models pattern-match on for "your call failed, try again" — outperforms trailing user nudges in the wire. |
| **Prerequisites** | Tool A called before prerequisite tool B | Same wire shape as step enforcement (`[PrereqError]`). Constraint is enforced via tool-error reply, not via the tool schema — see [ADR-006](decisions/006-tool-prerequisites.md). |
| **Compaction** | Conversation approaching context budget | Tiered, deterministic; no LLM call. Strategies own their own threshold logic. |
Each guardrail can be independently disabled via ablation presets in `tests/eval/ablation.py` — that's what produces the per-guardrail contribution numbers in the eval reports.
---
## Compaction Strategy Choice
Three built-in strategies; downstream consumers can supply their own by implementing the `CompactStrategy` interface.
- **`NoCompact`** — passthrough. Use when VRAM is abundant or workflows are short.
- **`SlidingWindowCompact`** — keeps system prompt, original user input, and the last N iterations. Simple, predictable. Good baseline.
- **`TieredCompact`** (default) — three-phase escalating compaction with an explicit priority order:
| Cut first | `step_nudge`, `prerequisite_nudge`, `retry_nudge` | Drop | Drop | Drop |
| Cut second | Older `tool_result` | Truncate ~200 chars | Drop | Drop |
| Cut third | `text_response` | Preserved | Preserved | Drop |
| Cut fourth | `reasoning` | Preserved | Preserved | Drop |
| Preserved | Older `tool_call` | Preserved | Preserved | Preserved (full) |
| Never cut | `system_prompt`, `user_input` | Preserved | Preserved | Preserved |
| Never cut | Recent iterations (`keep_recent`) | Preserved | Preserved | Preserved |
**Key design choice — reasoning survives through Phase 2.** The model's chain-of-thought from step 3 ("price below web but above historical") is what informs decisions at step 5+. Losing raw tool results is recoverable; losing the model's interpretation of those results is not. `text_response` (a failed tool call attempt) is expendable after the retry nudge corrects the model.
**Phase 3 is the emergency cutoff** — should only fire under extreme VRAM pressure.
All three phases are deterministic text manipulation — no LLM calls, sub-millisecond.
---
## The Synthetic `respond` Tool
Why it exists: when tools are present but the user sends a conversational message, small models must choose between calling a tool and responding with text. They frequently choose wrong. Eval testing showed that trusting the model's finish reason dropped workflow completion from 100% to as low as 4%.
The respond tool eliminates the open-ended choice. The model calls `respond(message="...")` instead of producing bare text. From forge's perspective, every response is a valid tool call — no retries wasted on conversational turns, no accuracy loss on tool-calling turns.
**Why this works for small models:** small models struggle with open-ended decisions ("tools or chat?") but are good at structured choices ("which tool?"). The respond tool converts an open-ended decision into a structured one. The model stays in tool-calling grammar at all times, which is where it performs best.
Full rationale and the bare-text eval data: [ADR-013](decisions/013-text-response-intent.md).
---
## Sampling Defaults
Each model family has its own card-recommended `temperature` / `top_p` / `top_k`. Running everything at a single default (the usual `0.7`) is a measurable handicap for most models. Forge ships a per-model recommendations map keyed on three identity forms: Ollama-style strings, GGUF stems, llamafile stems. Same value, three keys — vendor-specific guidance can diverge per backend without forcing alignment.
The flag is opt-in (`recommended_sampling=True`):
- **Off** (default) — forge stays out of the way; backend defaults apply. If forge has opinions about this model, it logs a one-shot INFO message pointing at the flag.
- **On, model known** — values applied; caller's explicit non-None kwargs win field-by-field.
- **On, model unknown** — raises `UnsupportedModelError`. Falling through to backend defaults silently would defeat the explicit opt-in.
Proxy mode doesn't consult the map — it plumbs whatever sampling params arrive in the request body. The calling client is expected to look up `get_sampling_defaults(model)` and include them in the body.
Full rationale: [ADR-014](decisions/014-recommended-sampling-opt-in.md). For supported models, citation links, and override patterns: [MODEL_GUIDE.md § Sampling Parameters](MODEL_GUIDE.md#sampling-parameters).
---
## Where to find things
- **What lives where in the code** — [WORKFLOW.md § Quick Reference](WORKFLOW.md#quick-reference)
- **Loop shape, message lifecycle, compaction flow** — diagrams in [WORKFLOW.md](WORKFLOW.md)
- **How to use forge** — [USER_GUIDE.md](USER_GUIDE.md)
- **Backends and boot commands** — [BACKEND_SETUP.md](BACKEND_SETUP.md)
- **Past decisions and rationale** — [decisions/](decisions/) (ADRs)
- **Class signatures and exact APIs** — source (`src/forge/`) is authoritative