llm-rs
A Unix-philosophy agentic CLI for Large Language Models. Inspired by simonw/llm, built for composability --- stdin/stdout pipelines, subprocess-based tool and provider extensibility (llm-tool-*, llm-provider-*), and multi-target output (native CLI, WASM, Python).
Scope. llm-rs is a library (with a CLI), not an orchestration framework. Hierarchical workflows compose via specialist tools — small llm-tool-* executables that may internally invoke llm prompt with a narrow agent. See doc/spec/external-tools.md for the protocol, doc/cookbook/specialist-tools.md for a worked example, and doc/research/specialist-tools-vs-sub-agents.md for why llm-rs is not building recursive sub-agent delegation.
Usage
# Send a prompt (streams to stdout)
|
# Positional text works too
# Specify model and system prompt
# Use Anthropic models
# Disable streaming
# Show token usage on stderr
# Skip logging this prompt
Tool calling
Built-in tools let the model call functions during a conversation. The CLI manages the chain loop automatically --- it sends tool calls to the executor, feeds results back, and repeats until the model responds with text.
# Enable a built-in tool
# Multiple tools
# Limit chain iterations (default: 5 for prompt/chat, 10 for agents)
# Debug mode: show tool calls/results on stderr
# Verbose mode: see chain loop iterations (-v summary, -vv full messages)
# List available built-in tools
Available built-in tools:
llm_version--- returns the CLI versionllm_time--- returns current UTC and local time with timezone
Verbose chain observability
When using tools, the -v/--verbose flag reveals what happens inside the chain loop --- which iteration you're on, what messages are being sent, per-iteration token usage, and tool call/result details.
# Level 1 (-v): iteration summary + tool debug
# stderr output:
# [chain] Iteration 1/5 | 1 message [user]
# [chain] Iteration 1 complete | usage: 10 input, 5 output | 1 tool call(s)
# Tool call: llm_time (id: call_1)
# Arguments: {}
# Tool result: {"utc_time":"...","local_time":"...","timezone":"..."}
# [chain] Iteration 2/5 | 3 messages [user, assistant+tools(1), tool(1)]
# [chain] Iteration 2 complete | usage: 20 input, 10 output | 0 tool call(s)
# Level 2 (-vv): also dumps full message JSON per iteration
# stderr additionally includes:
# [chain] Messages:
# [
# {"role": "user", "content": "What time is it?"}
# ]
--verbose implies --tools-debug --- no need for both flags. Works on both prompt and chat commands.
External tools
Any executable on $PATH named llm-tool-* is automatically discovered and usable with -T. External tools can be written in any language.
# List all tools (built-in + external)
# Use an external tool
# Mix built-in and external tools
Writing an external tool requires two things:
-
Schema: respond to
--schemawith JSON describing the tool:}}}} -
Execution: read arguments JSON from stdin, write result to stdout:
|
Exit 0 means success (stdout = output). Non-zero means error (stderr = error message). Default timeout: 30 seconds.
Specialist tools. An external tool can itself call llm prompt internally with a cheaper model and a narrow system prompt — an opaque "specialist" function from the parent LLM's perspective. This is llm-rs's answer to hierarchical workflows, in place of a recursive sub-agent runtime. Worked example: doc/cookbook/specialist-tools.md. Rationale: doc/research/specialist-tools-vs-sub-agents.md.
External providers
Any executable on $PATH named llm-provider-* extends llm-rs with new model providers. External providers can serve models from Ollama, llama.cpp, or any custom backend.
# Models from external providers appear alongside built-in ones
# Use a model from an external provider
# See all providers and tools
Writing an external provider requires metadata flags and a JSON stdin/stdout protocol:
--id--- print the provider name (e.g.ollama)--models--- print JSON array of model metadata--needs-key--- print{"needed":false}or{"needed":true,"env_var":"MY_KEY"}
On invocation, the provider reads a JSON request from stdin and writes either streaming JSONL lines or a single JSON response to stdout. See doc/implementation.md for the full protocol specification.
Conversations
Continue previous conversations, use multi-turn message input, and chat interactively.
# Continue the most recent conversation
# Continue a specific conversation by ID
# Load messages from a JSON file
# Load messages from stdin
|
# Get JSON output instead of streaming text
# Combine: messages input with JSON output
Interactive chat
# Start an interactive chat session
# Chat with a specific model and system prompt
# Chat with tools enabled
# Chat with verbose tool chain output
Parallel tool execution
When the model requests multiple tool calls in a single turn, llm-rs dispatches them concurrently by default. Results are returned in the same order the model asked for them.
# Default: parallel dispatch, unlimited concurrency within a single iteration
# Cap concurrency
# Force sequential dispatch (e.g. to inspect tools one at a time)
--tools-approve forces sequential dispatch automatically so approval prompts don't interleave on stdin. Flags apply to prompt, chat, and agent run. Agents can set parallel_tools / max_parallel_tools in TOML; CLI flags override.
Agents
Agents are TOML files that bundle a system prompt, model, tools, chain limit, options, budget, retry, and parallel-tool config. Global agents live in ~/.config/llm/agents/; project-local agents in ./.llm/agents/ (local shadows global).
# Run an agent
|
# CLI flags override agent TOML
# Dry-run: resolve model, provider, tools, options, budget, retry, and parallel config without calling the LLM
Example ~/.config/llm/agents/researcher.toml:
= "claude-sonnet-4-6"
= "You are a careful research assistant."
= ["llm_time", "llm_version"]
= 10
= true
= 4
[]
= 0.2
[]
= 50000
[]
= 3
= 1000
Budget tracking
Token usage accumulates across chain iterations. Pass -u to print cumulative totals; set [budget] max_tokens in an agent file to stop the chain when the total exceeds the cap. The chain finishes the current turn, emits a [budget] warning, and returns the partial result.
# Show cumulative usage across all chain iterations
# llm chat prints a session-wide usage summary on exit
Retry and backoff
Transient HTTP errors (429, 5xx) are retried with exponential backoff and jitter before any response bytes are streamed. Configure per-invocation with --retries or per-agent via [retry].
Options and aliases
Set persistent per-model options and model-name aliases in config.toml. CLI -o flags override config defaults per invocation.
# Options
# Aliases
Structured output
Force the model to return JSON conforming to a schema. Works with both OpenAI (native response_format) and Anthropic (transparent tool wrapping).
# Schema DSL: simple field definitions
# With field descriptions
# JSON Schema literal
# Schema from a file
# Multiple items: wrap in array
# Preview DSL output
Schema DSL types: str (default), int, float, bool.
Key management
Keys are resolved in order: --key flag, keys.toml, environment variable (e.g. OPENAI_API_KEY, ANTHROPIC_API_KEY).
Model management
Available models:
- OpenAI:
gpt-4o,gpt-4o-mini - Anthropic:
claude-opus-4-6,claude-sonnet-4-6,claude-haiku-4-5
Conversation logs
Every prompt is logged to a JSONL file (one per conversation). Logs are plain text --- inspect them with cat, grep, jq.
Log files live at ~/.local/share/llm/logs/. Each file is a JSONL conversation:
{"type":"conversation","v":1,"id":"01j5a...","model":"gpt-4o","name":"Hello","created":"2026-04-03T12:00:00Z"}
{"type":"response","id":"01j5b...","model":"gpt-4o","prompt":"Hello","response":"Hi!","usage":{"input":5,"output":3},"duration_ms":230,...}
Schema management
Plugins
Example output:
Compiled providers:
openai (2 models: gpt-4o, gpt-4o-mini)
anthropic (3 models: claude-opus-4-6, claude-sonnet-4-6, claude-haiku-4-5)
External providers:
ollama (/usr/local/bin/llm-provider-ollama) (3 models: llama3, mistral, phi3)
External tools:
web_search (/usr/local/bin/llm-tool-web-search) — Search the web
upper (/usr/local/bin/llm-tool-upper) — Uppercase text
Exit codes
| Code | Meaning |
|---|---|
| 0 | Success |
| 1 | Runtime error (I/O failure, storage error) |
| 2 | Configuration error (missing key, unknown model, bad config) |
| 3 | Provider error (API failure, network timeout) |
Library usage
In addition to the CLI, llm-rs can be used as a library from JavaScript/TypeScript (via WASM) or Python (via native module). Both support OpenAI and Anthropic.
WASM (browser / Obsidian plugin)
import init, { LlmClient } from '@llm-rs/wasm';
await init();
// Auto-detects provider from model name
const openai = new LlmClient('sk-...', 'gpt-4o');
const claude = new LlmClient('sk-ant-...', 'claude-sonnet-4-6');
// Or use explicit constructors
const client = LlmClient.newAnthropic('sk-ant-...', 'claude-sonnet-4-6');
const custom = LlmClient.newAnthropicWithBaseUrl('sk-ant-...', 'claude-sonnet-4-6', 'https://my-proxy.example.com');
// Non-streaming
const response = await client.prompt('Hello');
// With system prompt
const answer = await client.promptWithSystem('What is 2+2?', 'Answer only with the number');
// Streaming (callback per chunk)
await client.promptStreaming('Tell me a story', (chunk) => {
process.stdout.write(chunk);
});
// With options
const result = await client.promptWithOptions(
'Hello',
null, // system prompt (optional)
'{"temperature": 0.7, "max_tokens": 1000}'
);
Build from source:
The WASM module is stateless --- no config files, no log storage. HTTP goes through the browser's fetch() API. The host application manages API keys and persistence.
Python
# Auto-detects provider from model name
=
=
# Or specify provider explicitly
=
# Non-streaming
=
# With system prompt
=
# Streaming (Python iterator)
Build from source (requires uv):
&&
Optional parameters: provider ("openai" or "anthropic"), base_url for custom API endpoints, log_dir to enable JSONL logging.
Installation
Requires Rust 1.85+ (2024 edition).
Or build from the workspace:
# Binary is at target/release/llm
Configuration
Config files live in XDG-standard directories:
~/.config/llm/config.toml # Main configuration
~/.config/llm/keys.toml # API keys (0600 permissions)
~/.local/share/llm/logs/ # Conversation logs (JSONL)
Set LLM_USER_PATH to put everything in one directory (useful for testing or migrating from Python llm).
config.toml:
= "gpt-4o-mini"
= true
[]
= "claude-sonnet-4-6"
= "gpt-4o-mini"
keys.toml:
= "sk-..."
= "sk-ant-..."
Environment variables
| Variable | Purpose |
|---|---|
OPENAI_API_KEY |
OpenAI API key (fallback if not in keys.toml) |
ANTHROPIC_API_KEY |
Anthropic API key (fallback if not in keys.toml) |
OPENAI_BASE_URL |
Override OpenAI API endpoint (for compatible APIs) |
ANTHROPIC_BASE_URL |
Override Anthropic API endpoint |
LLM_DEFAULT_MODEL |
Override default model |
LLM_USER_PATH |
Override config/data directory (flat layout) |
Architecture
Seven Rust crates in a Cargo workspace:
crates/
llm-core/ Traits, types, streaming, errors, config, keys, schema DSL, chain loop
llm-openai/ OpenAI Chat API provider (streaming + tools + structured output)
llm-anthropic/ Anthropic Messages API provider (streaming + tools + structured output)
llm-store/ JSONL conversation log storage and queries
llm-cli/ CLI binary (the `llm` command)
llm-wasm/ WASM library for browser/Obsidian (excluded from workspace)
llm-python/ Python native module via PyO3 (excluded from workspace)
Dependency flow: llm-cli, llm-wasm, and llm-python are top-level entry points -> llm-openai + llm-anthropic + llm-store -> llm-core. No cycles.
Key design choices vs the Python original:
- Subprocess extensibility, not in-process plugins. Instead of Python's pluggy-based plugin system, external tools (
llm-tool-*) and providers (llm-provider-*) are standalone executables discovered on$PATH. Any language can implement the JSON stdin/stdout protocol. Compiled-in providers (OpenAI, Anthropic) are feature-gated for a minimal core binary. - JSONL storage. One file per conversation instead of SQLite. Append-only, human-readable, no migrations.
- Async-first. Single
Providertrait using futures streams, no sync/async class duplication. - TOML config. Two files (
config.toml+keys.toml) instead of six scattered JSON/YAML/text files. - Feature-gated providers. Compile only the providers you need:
--features openai,anthropic(both default), or--no-default-featuresfor a minimal binary. - Multi-target. Core crates compile for both native and
wasm32-unknown-unknown. The same provider code runs in the CLI, in a browser, and in a Python module.
See doc/design/architecture.md for design rationale, doc/roadmap.md for the phased roadmap.
Testing
| Crate | Tests | What's covered |
|---|---|---|
llm-core |
198 | Types, config, keys, streams, schema DSL, chain loop, ChainEvent, ParallelConfig dispatch, messages, agent config, retry, budget (mock provider) |
llm-openai |
44 | HTTP mocking (wiremock), SSE parsing, tool calls, structured output, multi-turn, HttpError mapping |
llm-anthropic |
50 | HTTP mocking (wiremock), typed SSE, tool_use blocks, transparent schema wrapping, multi-turn, HttpError mapping |
llm-store |
49 | JSONL round-trips, unicode, malformed recovery, listing/queries, message reconstruction |
llm-cli |
189 | Subprocess protocol/discovery/execution, retry wrapper, dry-run rendering (62 unit), CLI integration (127 e2e with assert_cmd) |
Library targets are verified by their build toolchains: wasm-pack build for WASM, maturin develop for Python.
Status
Current version: v0.9. Phases 1–9 complete. See doc/roadmap.md for the full status table and remaining work.
- v0.1 --- CLI, WASM library, Python module; OpenAI + Anthropic providers end-to-end.
- v0.2 --- Tool calling, chain loop, built-in tools, structured output, schema DSL.
- v0.3 --- Multi-turn conversations,
-c/--cid,llm chatREPL, fullllm logs. - v0.4 --- Subprocess extensibility (
llm-tool-*,llm-provider-*),llm plugins,-v/--verbose,-o/--option, aliases. - v0.5 --- Agent config & discovery (
llm agent run/list/show/init/path). - v0.6 --- Budget tracking with cumulative usage and per-chain enforcement.
- v0.7 --- Retry/backoff with exponential delay and jitter for transient HTTP errors.
- v0.8 ---
--dry-runforllm agent run(plain or--json). - v0.9 --- Parallel tool execution within a chain iteration, order-preserving, opt-out with
--sequential-tools.
Next up: Ollama provider, attachments, extract flags. See the Future Work section of the roadmap. Sub-agent delegation and an agent memory system are explicitly parked — llm-rs delegates hierarchical workflows to specialist tools. See the design note for the rationale.
Out of scope: token budget enforcement across nested invocations. Each llm call tracks its own budget; users who need hierarchical budget caps should do shell-level accounting (e.g., sum usage from --json output across a wrapping script).
License
GPLv3