BattleCommand Forge
Status: stable port at v0.1.0 (April 2026). Battle-command-forge is the quality-first code-generation branch of an AI-agent project family I've been building since January 2026. This release is a public port of internal pipeline work — the code is field-tested (86/86 unit tests, plus the 10-mission stress suite documented below) but active feature development has slowed while I focus on sibling projects in the same family. Issues and PRs remain welcome. Related: claudette (local-first personal assistant, on crates.io) · ABCC (the godfather — RTS-style TUI for agent orchestration).
Quality-first AI coding army. A pure Rust single binary (~3.7 MB) that generates production-grade code using a 9-stage quality pipeline with TDD enforcement, multi-file extraction, surgical fix-pass retry, and a complexity-scaled quality gate.
Give it a mission. Get back a tested, reviewed, production-ready project.
$ battlecommand-forge mission "Build a FastAPI CRUD API for managing books with SQLite" --auto
[ROUTER] Complexity: C4 (moderate) — CRUD + SQLite + FastAPI
[ARCHITECT] Spec: 6 files, 77 lines, TDD plan with 12 test cases
[TESTER] 24 tests written (pytest-asyncio, Pydantic v2 fixtures)
[CODER] Generating 6 files (single-shot, 80B model)...
[VERIFIER] venv created, deps installed, ruff clean, 22/24 tests pass
[SECURITY] OWASP review: no critical issues, score 8.5/10
[CRITIQUE] DEV=9.0 ARCH=8.5 TEST=8.0 SEC=8.5 DOCS=7.5 → avg 8.3
[CTO] Approved — coherent, well-structured, ships as-is
[GATE] Score: 9.1/10 (threshold: 9.2) — FIX ROUND 1
[FIX] Surgical fix: 1 file (models.py — dual Base bug)
[VERIFIER] 24/24 tests pass
[GATE] Score: 9.4/10 — SHIPPED ✅
Output: output/fastapi_books_crud/
Table of Contents
- 30-Second Quick Start
- Installation
- Your First Mission
- CLI Commands
- Interactive TUI
- Presets
- Configuration
- Environment Variables
- Common Workflows
- How the Pipeline Works
- Troubleshooting
- Architecture
30-Second Quick Start
# 1. Build
# 2. Start Ollama (if not running)
&
# 3. Pull a model (7B for fast start — upgrade later)
# 4. Run your first mission
That's it. Your generated project is in output/.
Installation
Prerequisites
| Requirement | Version | Why |
|---|---|---|
| Rust | 1.91+ | Building the binary |
| Ollama | Latest | Running local models |
| Python | 3.10+ | Generated code + verifier (creates venvs) |
Build from Source
The binary is at ./target/release/battlecommand-forge (~3.7 MB with LTO + strip).
Optional: Copy it to your PATH:
Now you can use bcf from anywhere.
Pull Models
The preset you choose determines which models you need:
# Fast preset ($0, needs 8GB RAM)
# Balanced preset ($0, needs 20GB RAM)
# Premium preset (best quality, needs 48GB+ VRAM for local models)
Premium also uses cloud APIs — set these if you want the best results:
# Add to your ~/.zshrc or ~/.bashrc
# Claude Opus/Sonnet (~$0.20-0.30/mission)
# Grok architect (optional, ~$0.10/mission)
Verify Installation
BattleCommand Forge v0.1.0
Modules: 30 | Pipeline: 9-stage | Gate: 8.0-9.2/10 (scaled) | Fix rounds: 5
Ollama: connected (12 models)
Claude: configured
GitHub: gh authenticated
Workspaces: 47
Total missions: 47 | Avg score: 7.8/10
Total cost: $4.2100
Your First Mission
Example 1: Simple CLI Tool (C3, ~$0)
What happens:
- Router scores this as C3 (low complexity)
- Architect writes a concise spec with 4 files
- Tester writes 8 test cases
- Coder generates all files in one shot
- Verifier creates a venv, runs ruff + pytest
- If tests fail, surgical fix targets the exact broken file
- Output lands in
output/csv_to_json_cli/
Your output directory:
output/csv_to_json_cli/
├── main.py # CLI entry point (click/argparse)
├── converter.py # Core conversion logic
├── tests/
│ └── test_converter.py # 8 test cases
├── requirements.txt
└── pyproject.toml
Example 2: REST API (C5, ~$0.30)
Example 3: Complex Auth Service (C8, ~$0.50)
Example 4: Edit Existing Code
# Point at an existing project and describe what to change
Example 5: Custom Output Directory
Example 6: Use a GitHub Repo as Context
Example 7: Voice Announcements (macOS)
# Announces: "Mission started..." "Quality gate passed..." "Mission complete."
CLI Commands
mission — Run the full pipeline
| Flag | Default | Description |
|---|---|---|
--preset |
premium |
Model preset: fast, balanced, premium |
--auto |
off | Skip human approval, auto-continue fix rounds |
-o, --output |
auto | Custom output directory |
--repo |
— | Clone a GitHub repo as context |
--path |
— | Use a local directory as context |
--voice |
off | macOS TTS announcements |
--architect-model |
— | Override architect model |
--tester-model |
— | Override tester model |
--coder-model |
— | Override coder model |
--reviewer-model |
— | Override security + critique + CTO |
Examples:
# Fully automatic, premium quality
# Fast iteration, local only, no API costs
# Manual mode — approve each stage
# Override just the coder model
chat — CTO Research Chat (CLI)
Interactive REPL where you chat with the CTO agent to plan missions before launching. Supports tool calling (web search, file reading, directory listing), conversation history, and launching missions directly from chat.
$ battlecommand-forge chat --preset premium
BattleCommand Forge — CTO Chat (claude-sonnet-4-6)
Plan your mission, research architecture, or ask anything.
Type /mission <prompt> to launch. /clear to reset. /quit to exit.
> What's the best database for a real-time chat app?
[web_search: "best database real-time chat 2026"]
[web_search → Redis for pub/sub + message queue, PostgreSQL for persistence...]
For a real-time chat app, I recommend a dual-database approach:
- **Redis** for pub/sub messaging, presence tracking, and room state
- **PostgreSQL** (async via SQLAlchemy) for message history and user accounts
...
> /mission Build a WebSocket chat server with Redis pub/sub, PostgreSQL message history, FastAPI, rooms, and user presence
Launching mission: Build a WebSocket chat server...
[ROUTER] Complexity: C7 (high)
...
Chat commands:
| Command | Action |
|---|---|
/mission <prompt> |
Launch a mission from chat |
/clear |
Clear conversation history |
/compress |
Compact long history |
/help |
List commands |
/quit |
Exit chat |
tui — Interactive Terminal UI
Full-featured 6-tab interface with CTO research chat, model picker, hardware monitoring, and 15 slash commands. See Interactive TUI for details.
edit — Modify Existing Code
# Add tests to an existing project
# Refactor a module
verify — Run Quality Checks
Creates a venv, installs deps, runs ruff (linting) and pytest. Shows score, test results, lint issues, and secret detection.
# Verify a generated project
# Verify your own project
models — List, Benchmark, and Compare
# List all Ollama models
# Benchmark a specific model (speed + quality test)
# Show preset configurations
stress — Run the Stress Test Suite
# Run all 21 graded tasks (C4-C9)
# Quick smoke test (5 tasks)
status — System Health Check
Shows Ollama connection, API key status, workspace count, mission history, and total cost.
report — View Pipeline Reports
# List all reports
# Show the latest report (detailed breakdown)
audit — View Audit Log
# Last 20 entries
settings — Model Configuration
# Show resolved config for a preset
# Generate default .battlecommand/models.toml
github — Push and Create PRs
# Check if gh CLI is available
# Push a workspace
# Create a PR
hw — Hardware Metrics
Shows CPU, RAM, Ollama status, and VRAM usage.
Interactive TUI
Launch with:
6 Tabs
| Tab | Key | What's There |
|---|---|---|
| Chat | 1 |
CTO research agent with 10 tools |
| Queue | 2 |
Mission queue and status |
| Models | 3 |
Available Ollama models |
| Code | 4 |
Generated code viewer |
| HW | 5 |
Live hardware metrics |
| Log | 6 |
Pipeline activity log |
Chat with the CTO Agent
The Chat tab connects you to an AI CTO agent with tool access. It can search the web, read files, list directories, and launch missions — all from the chat interface.
> What's the best way to structure a FastAPI app with SQLAlchemy async?
CTO: For a production FastAPI + async SQLAlchemy setup, I recommend...
[web_search: "FastAPI async SQLAlchemy best practices 2026"]
...here's the recommended structure:
app/
main.py # FastAPI app + middleware
database.py # Single AsyncEngine + AsyncSession factory
models.py # SQLAlchemy ORM models (import Base from database)
schemas.py # Pydantic v2 models (UserRead, UserCreate)
routes/ # One file per resource
dependencies.py # get_db session dependency
...
> /mission Build that FastAPI app with user CRUD and SQLAlchemy async
[Mission launched: Build that FastAPI app...]
CTO Tools (10):
| Tool | What It Does |
|---|---|
web_search |
Search the web (Brave Search or DuckDuckGo fallback) |
web_fetch |
Fetch and read a web page |
read_file |
Read any local file |
list_files |
List directory contents |
run_mission |
Launch a mission from chat |
refine_prompt |
Improve a mission prompt before running |
verify_project |
Run verifier on a project directory |
list_reports |
List pipeline reports |
open_browser |
Open a URL in the default browser |
status |
Show system status |
Slash Commands
Type these in the Chat input:
| Command | What It Does |
|---|---|
/mission <prompt> |
Launch a mission |
/verify [path] |
Run verifier (default: latest output) |
/report [list|show] |
View pipeline reports |
/audit [n] |
Show audit log (default: 10) |
/preset <name> |
Switch preset (fast/balanced/premium) |
/cost |
Show total API cost |
/settings |
Open model picker overlay |
/clear |
Clear chat + CTO history |
/compress |
Compact CTO conversation history |
/models |
Switch to Models tab |
/hw |
Switch to Hardware tab |
/status |
Show workspace/system info |
/snake |
Play snake! |
/space |
Play Space Invaders! |
/help |
List all commands |
Keyboard Shortcuts
| Key | Action |
|---|---|
1-6 |
Switch tabs (when input is empty) |
Tab |
Cycle through tabs |
PgUp / PgDn |
Scroll chat ±20 lines |
Up / Down |
Scroll chat ±3 lines (when input is empty) |
Home / End |
Scroll to top/bottom or cursor start/end |
Esc |
Clear input or quit |
Ctrl+C |
Quit |
Presets
Presets control which models are used for each pipeline role.
Fast — $0, Runs Anywhere
All roles use qwen2.5-coder:7b. Needs ~8GB RAM. Best for quick iteration and testing.
Balanced — $0, Better Quality
Uses qwen2.5-coder:32b for architect/coder. Needs ~20GB RAM. Good balance of speed and quality.
Premium — ~$0.30/mission, Best Quality
Dream Team v3: local 32B architect + Opus tester + local 80B coder + Sonnet fix/CTO + local 30B critics. Needs 48GB+ VRAM and ANTHROPIC_API_KEY.
| Role | Model | Provider | Cost |
|---|---|---|---|
| Architect | qwen2.5-coder:32b | Local | $0 |
| Tester | claude-opus-4-6 | Anthropic | ~$0.20 |
| Coder | qwen3-coder-next:q8_0 (80B) | Local | $0 |
| Fix Coder | claude-sonnet-4-6 | Anthropic | ~$0.05 |
| Security | qwen3-coder:30b-a3b-q8_0 | Local | $0 |
| Critique | qwen3-coder:30b-a3b-q8_0 | Local | $0 |
| CTO | claude-sonnet-4-6 | Anthropic | ~$0.05 |
Upgrade to full Dream Team v3 — add Grok reasoning architect for complex missions (C7+):
Comparing Presets
| Fast | Balanced | Premium | |
|---|---|---|---|
| Avg score | 6.5/10 | 7.2/10 | 8.5/10 |
| Tests passing | ~40% | ~60% | ~85% |
| Cost per mission | $0 | $0 | ~$0.30 |
| Min RAM/VRAM | 8 GB | 20 GB | 48 GB |
| Time per mission | 3-8 min | 8-15 min | 8-12 min |
Configuration
Priority Order
Configuration resolves in this order (last wins):
Preset defaults → Environment variables → .battlecommand/models.toml → CLI flags
Config File
Generate the default config:
This creates .battlecommand/models.toml:
# BattleCommand Forge — Model Configuration
= "premium"
# Uncomment to customize any role:
# [architect]
# model = "qwen2.5-coder:32b"
# context_size = 32768
# max_predict = 4096
# [tester]
# model = "claude-opus-4-6"
# context_size = 200000
# max_predict = 8192
# [coder]
# model = "qwen3-coder-next:q8_0"
# context_size = 65536
# max_predict = 32768
# [fix_coder]
# model = "claude-sonnet-4-6"
# [security]
# model = "qwen3-coder:30b-a3b-q8_0"
# [critique]
# model = "qwen3-coder:30b-a3b-q8_0"
# [cto]
# model = "claude-sonnet-4-6"
Per-Role Options
Each role section supports:
| Field | Default | Description |
|---|---|---|
model |
from preset | Model name (e.g. claude-sonnet-4-6, qwen2.5-coder:32b) |
context_size |
32768 | Context window in tokens |
max_predict |
8192 | Max output tokens |
Provider is auto-detected: claude-* and grok-* models → cloud, everything else → local (Ollama).
CLI Overrides (Highest Priority)
# Override a single role
# Override all reviewers at once
# Mix and match
Environment Variables
Required (for cloud models)
# Claude Opus/Sonnet
# Grok models
Optional
| Variable | Example | Description |
|---|---|---|
OLLAMA_HOST |
192.168.1.100:11434 |
Remote Ollama URL |
BRAVE_API_KEY |
BSA... |
CTO web search (falls back to DuckDuckGo) |
ARCHITECT_MODEL |
grok-4.20-reasoning |
Override architect |
TESTER_MODEL |
claude-opus-4-6 |
Override tester |
CODER_MODEL |
qwen3-coder-next:q8_0 |
Override coder |
FIX_CODER_MODEL |
claude-sonnet-4-6 |
Override fix coder |
SECURITY_MODEL |
qwen3-coder:30b-a3b-q8_0 |
Override security reviewer |
CRITIQUE_MODEL |
qwen3-coder:30b-a3b-q8_0 |
Override critique panel |
CTO_MODEL |
claude-sonnet-4-6 |
Override CTO |
REVIEWER_MODEL |
claude-sonnet-4-6 |
Override security + critique + CTO together |
You can also use a .env file in the project root — it's loaded automatically.
Remote Ollama (Cloud GPU)
Run models on a remote machine with a beefy GPU:
# On the remote machine (H200, A100, etc.):
# From your Mac:
OLLAMA_HOST=192.168.1.100:11434
Common Workflows
Workflow 1: Quick Prototype → Polish
# 1. Fast draft to see the shape
# 2. Check what needs fixing
# 3. Re-run with premium for production quality
Workflow 2: Research → Build
# 1. Launch TUI
# 2. Chat with CTO to research architecture
> What
> What
# 3. Launch mission from chat when ready
> /mission
Workflow 3: Iterate on Existing Code
# 1. Generate initial project
# 2. Add features with edit mode
# 3. Verify after each edit
Workflow 4: Stress Test Your Pipeline
# Run graded tasks to benchmark your model setup
# Check reports
Workflow 5: Ship to GitHub
# 1. Generate project
# 2. Push to GitHub
# 3. Create a PR
How the Pipeline Works
┌─────────┐ ┌───────────┐ ┌────────┐ ┌───────┐ ┌──────────┐
│ ROUTER │──▶│ ARCHITECT │──▶│ TESTER │──▶│ CODER │──▶│ VERIFIER │
│ C1-C10 │ │ ADR+spec │ │ TDD │ │ code │ │ venv+test│
└─────────┘ └───────────┘ └────────┘ └───────┘ └──────────┘
│
┌──────────────────────────────────────────────┘
▼
┌──────────┐ ┌──────────┐ ┌─────┐ ┌──────────────┐
│ SECURITY │──▶│ CRITIQUE │──▶│ CTO │──▶│ QUALITY GATE │
│ OWASP │ │ 5 scores │ │ ok? │ │ ship/fix? │
└──────────┘ └──────────┘ └─────┘ └──────────────┘
│
┌─────┴─────┐
│ │
PASS FAIL
│ │
SHIP ┌─────▼─────┐
✅ │ SURGICAL │
│ FIX LOOP │──▶ back to CODER
│ (≤5 rounds)│
└───────────┘
Stage Details
| # | Stage | What It Does | Model |
|---|---|---|---|
| 1 | Router | Scores complexity C1-C10 (rules + AI) | Local small |
| 2 | Architect | Writes ADR, file manifest, TDD test plan | Local 32B |
| 3 | Tester | Writes complete test suite BEFORE code | Opus ($0.20) |
| 4 | Coder | Generates all files in single shot | Local 80B |
| 5 | Verifier | Creates venv, pip install, ruff, pytest | — |
| 6 | Security | OWASP Top 10 review | Local/Sonnet |
| 7 | Critique | 5-in-1 scoring: DEV/ARCH/TEST/SEC/DOCS | Local/Sonnet |
| 8 | CTO | Mission-level coherence check | Sonnet ($0.05) |
| 9 | Gate | Ships if critique*0.4 + verifier*0.6 ≥ threshold |
— |
Surgical Fix Loop
When the quality gate fails, the pipeline doesn't regenerate everything. It:
- Traces imports — finds exactly which files are broken (NameError, ImportError, etc.)
- Fixes surgically — each broken file gets its own LLM call with specific error context
- Preserves clean code — files that pass stay untouched
- Tracks progress — if score declines 2+ rounds in a row, restores the best round's files
- Limits scope — fix rounds fix bugs only, never add features (prevents regression)
Quality Gate Thresholds
| Complexity | Threshold | Why |
|---|---|---|
| C1-C6 | 9.2/10 | Simple tasks should be near-perfect |
| C7-C8 | 8.5/10 | Complex but achievable with fix rounds |
| C9-C10 | 8.0/10 | Very complex — reward functional code |
Troubleshooting
"Ollama: not running"
# Start Ollama
# Or check if it's running
"No models available"
# Pull the model you need
# List what's installed
Tests always fail / score stuck below 7
This usually means the tester model is too weak. The #1 improvement is using Opus as the tester:
Opus writes correct test fixtures on the first try (~$0.20). Local 32B testers write tests that never run.
"Connection refused" with remote Ollama
# Make sure Ollama is listening on all interfaces (not just localhost)
OLLAMA_HOST=0.0.0.0:11434
# Test from your Mac
Build fails with old Rust
# Needs Rust 1.91+
High API costs
Switch to a cheaper preset or use all-local models:
# $0 — all local
# Or just make the coder local, keep cloud reviews
Check your spend anytime:
# Or in TUI: /cost
Generated code has import errors
This is the most common issue. The pipeline handles it automatically via surgical fix rounds, but if you see it in the final output:
# Re-verify to see exact errors
# The error patterns are saved for future runs
Architecture
30 Modules, 14,000+ Lines of Rust
| Module | Purpose |
|---|---|
mission.rs |
9-stage pipeline orchestration + surgical fix loop |
tui.rs |
6-tab interactive TUI with CTO chat + 15 slash commands |
llm.rs |
Claude API + Ollama + Grok client + streaming + tool calling |
cto.rs |
CTO agent with 10 tools (web search, file read, verify, etc.) |
verifier.rs |
Venv creation + pip install + ruff + pytest |
codegen.rs |
Multi-file extraction from LLM output |
model_config.rs |
Per-role model config (preset → env → TOML → CLI) |
model_picker.rs |
Interactive model selection UI overlay |
router.rs |
Dual complexity scoring (rules + AI) |
editor.rs |
Edit existing codebases via LLM |
sandbox.rs |
Sandboxed execution, timeouts, env stripping |
memory.rs |
Learnings + few-shot examples + context injection |
enterprise.rs |
Audit logging, cost tracking, RBAC |
report.rs |
Pipeline report generation + viewer |
hardware.rs |
CPU/RAM/VRAM/Ollama monitoring |
models.rs |
Model listing, benchmarking, VRAM estimation |
workspace.rs |
Isolated git workspaces per mission |
swebench.rs |
SWE-bench evaluation: ReAct agent loop, dataset handling |
swebench_tools.rs |
7 ReAct tools: read_file, grep, list_dir, run_command, write, edit, submit |
swebench_eval.rs |
SWE-bench report generation with per-repo breakdown |
benchmark.rs |
Multi-model benchmark framework (5 graded missions) |
swarm.rs |
Swarm mode: planner→coder→QA iteration with best-version selection |
custom_commands.rs |
User-defined commands from .battlecommand/commands/*.md |
stress.rs |
21-task stress test suite (C4-C9) |
snake.rs |
Easter egg snake game |
space.rs |
Easter egg Space Invaders game |
db.rs |
Mission history (JSON file-based) |
context.rs |
Context compaction at 95% capacity |
github.rs |
GitHub push/PR via gh CLI |
voice.rs |
macOS TTS announcements |
Key Design Decisions
- Pure Rust, no Python bridge — single binary, no runtime deps
- TDD-first pipeline — tests are written BEFORE code, not after
- Surgical fixes over regeneration — fix rounds target only broken files, preserving working code
- Mix-and-match models — each pipeline role can use a different model (local or cloud)
- Venv per project — generated code runs in isolated environments
- Streaming everything — architect/tester/coder output streams live to terminal
- Quality gate is intentionally high — pushes models to produce production-grade output
License
Apache-2.0. See LICENSE.