brainwires-providers
AI provider implementations for the Brainwires Agent Framework.
Overview
brainwires-providers provides concrete implementations of the Provider trait for multiple AI services: Anthropic (Claude), OpenAI (GPT), Google (Gemini), Ollama, and local LLM inference via llama.cpp. Every provider converts to and from the unified brainwires-core message types, so callers can swap backends without changing application code.
Design principles:
- Unified interface — all providers implement the same
Providertrait frombrainwires-core(chat, streaming, tool calling) - Feature-gated backends — cloud providers compile under
native(default); local LLM compiles always; llama.cpp is behindllama-cpp-2 - Rate limiting built in — token-bucket
RateLimiterandRateLimitedClientavailable to any provider - Streaming-first — every provider returns
BoxStream<Result<StreamChunk>>viaasync_stream - Tool calling — Anthropic, OpenAI, Google, and Ollama all support function calling mapped to/from
brainwires_core::Tool - Local inference — CPU-optimized GGUF model support with model registry, preset configs, and inference parameter tuning
┌───────────────────────────────────────────────────────────────────────┐
│ brainwires-providers │
│ │
│ ┌─── Provider trait (brainwires-core) ────────────────────────────┐ │
│ │ chat() ──► ChatResponse │ │
│ │ stream_chat() ──► BoxStream<StreamChunk> │ │
│ │ name() ──► &str │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─── Cloud Providers (feature = "native") ────────────────────────┐ │
│ │ │ │
│ │ AnthropicProvider ──► SSE streaming ──► api.anthropic.com │ │
│ │ OpenAIProvider ──► JSON Lines ──► api.openai.com │ │
│ │ GoogleProvider ──► event-stream ──► generativelanguage.… │ │
│ │ OllamaProvider ──► line-delim JSON ► localhost:11434 │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ RateLimitedClient ──► RateLimiter (token-bucket) │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─── Local LLM (always compiled, llama.cpp optional) ─────────────┐ │
│ │ │ │
│ │ LocalLlmProvider ──► generate() / route() / process() │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ LocalLlmConfig ◄── LocalModelRegistry ◄── scan_models_dir() │ │
│ │ LocalModelType ◄── chat_template() / stop_tokens() │ │
│ │ LocalInferenceParams ◄── factual() / creative() / routing() │ │
│ │ LocalLlmPool ──► round-robin multi-instance inference │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─── Shared Types ───────────────────────────────────────────────┐ │
│ │ ProviderType (Anthropic | OpenAI | Google | Ollama | Custom) │ │
│ │ ProviderConfig (provider, model, api_key, base_url, options) │ │
│ └────────────────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────────────┘
Quick Start
Add to your Cargo.toml:
[]
= "0.6"
Send a chat request with the Anthropic provider:
use ;
use Message;
async
Features
| Feature | Default | Description |
|---|---|---|
native |
Yes | Enables cloud providers (Anthropic, OpenAI, Google, Ollama), RateLimiter, RateLimitedClient, and their dependencies (reqwest, tokio, async-stream, tracing, uuid) |
llama-cpp-2 |
No | Enables local LLM inference via llama.cpp bindings. Heavy dependency (~long compile). Adds tracing and tokio even without native |
# Default (cloud providers only)
= "0.6"
# With local LLM support
= { = "0.6", = ["llama-cpp-2"] }
# Local LLM only (no cloud providers)
= { = "0.6", = false, = ["llama-cpp-2"] }
Architecture
Provider Trait
All providers implement the Provider trait from brainwires-core. This is the unified interface that callers program against.
| Method | Description |
|---|---|
name() |
Provider identifier string (e.g., "anthropic", "lfm2-350m") |
max_output_tokens() |
Optional maximum output token limit for the provider |
chat(messages, tools, options) |
Non-streaming chat completion returning ChatResponse |
stream_chat(messages, tools, options) |
Streaming chat returning BoxStream<Result<StreamChunk>> |
ChatOptions controls per-request behavior:
| Field | Type | Description |
|---|---|---|
system |
Option<String> |
System prompt |
temperature |
Option<f32> |
Sampling temperature (0.0–2.0) |
max_tokens |
Option<u32> |
Maximum tokens to generate |
stop |
Option<Vec<String>> |
Stop sequences |
StreamChunk variants:
| Variant | Description |
|---|---|
Text(String) |
Generated text token |
Usage(Usage) |
Token usage counts (input + output) |
Done |
Stream completion marker |
ProviderType
Enum identifying the AI provider backend.
| Variant | as_str() |
default_model() |
|---|---|---|
Anthropic |
"anthropic" |
claude-sonnet-4-20250514 |
OpenAI |
"openai" |
gpt-5-mini |
Google |
"google" |
gemini-2.5-flash |
Ollama |
"ollama" |
llama3.3 |
Custom |
"custom" |
claude-sonnet-4-20250514 |
FromStr also accepts aliases: "gemini" maps to Google, "brainwires" maps to Custom.
ProviderConfig
Configuration struct for initializing a provider.
| Field | Type | Default | Description |
|---|---|---|---|
provider |
ProviderType |
— | Provider backend |
model |
String |
— | Model name |
api_key |
Option<String> |
None |
API key (skipped in serialization if absent) |
base_url |
Option<String> |
None |
Custom endpoint URL |
options |
HashMap<String, Value> |
{} |
Additional provider-specific options (flattened in JSON) |
Builder methods: new(provider, model), with_api_key(key), with_base_url(url).
RateLimiter
Token-bucket rate limiter using atomic operations for lock-free reads.
| Field | Type | Description |
|---|---|---|
tokens |
AtomicU32 |
Current available tokens |
max_tokens |
u32 |
Configured requests-per-minute limit |
refill_interval |
Duration |
Time between token refills (60s / rpm) |
last_refill |
Mutex<Instant> |
Timestamp of last refill |
| Method | Description |
|---|---|
new(requests_per_minute) |
Create a limiter with the given RPM cap |
acquire() |
Async — consume one token, wait if depleted |
available_tokens() |
Current token count (diagnostic) |
max_requests_per_minute() |
Configured limit |
RateLimitedClient
Wraps reqwest::Client with an optional RateLimiter. Every outgoing request waits for a token before sending.
| Method | Description |
|---|---|
new() |
Create client with no rate limiting |
with_rate_limit(rpm) |
Create client with the given RPM limit |
from_client(client, rpm) |
Wrap an existing reqwest::Client |
get(url) |
Build a GET request (waits for token first) |
post(url) |
Build a POST request (waits for token first) |
inner() |
Access the underlying reqwest::Client |
available_tokens() |
Returns Option<u32> — None if no limiter |
AnthropicProvider
Implements the Provider trait for the Anthropic Messages API (https://api.anthropic.com/v1/messages, version 2023-06-01).
| Constructor | Description |
|---|---|
new(api_key, model) |
Create without rate limiting |
with_rate_limit(api_key, model, rpm) |
Create with rate limiting |
Streaming: Parses Server-Sent Events (SSE) with data: prefix. Events include message_start, content_block_delta, message_delta, and message_stop.
Internal types: AnthropicMessage, AnthropicContentBlock (Text, ToolUse, ToolResult), AnthropicTool, AnthropicResponse, AnthropicStreamEvent, AnthropicDelta.
Message conversion: System messages are extracted from the message list and sent as a top-level system field. All other messages are converted to Anthropic's role/content-block format.
OpenAIProvider
Implements the Provider trait for the OpenAI Chat Completions API (https://api.openai.com/v1/chat/completions).
| Constructor | Description |
|---|---|
new(api_key, model) |
Create without rate limiting |
with_rate_limit(api_key, model, rpm) |
Create with rate limiting |
with_organization(org_id) |
Set the OpenAI-Organization header |
Streaming: Parses newline-delimited JSON (JSON Lines). Each line is a data: {json} SSE chunk with choices[0].delta.
O1/O3 model detection: is_o1_model() detects reasoning models (o1, o3 prefixes) which do not support temperature, max_tokens, or system messages.
Image support: Converts ContentBlock::Image to base64-encoded image_url content parts.
Internal types: OpenAIMessage, OpenAIContent (Text or Array), OpenAIContentPart (Text, ImageUrl, ToolCall), OpenAITool, OpenAIResponse.
GoogleProvider
Implements the Provider trait for the Gemini API (https://generativelanguage.googleapis.com/v1beta).
| Constructor | Description |
|---|---|
new(api_key, model) |
Create without rate limiting |
with_rate_limit(api_key, model, rpm) |
Create with rate limiting |
Streaming: Uses text/event-stream with custom Gemini event format.
Message conversion: System messages are filtered out and sent via systemInstruction. The assistant role maps to "model" in Gemini's API.
Image support: Converts images to inlineData parts with MIME type and base64 data.
Internal types: GeminiMessage, GeminiPart (Text, InlineData, FunctionCall, FunctionResult), GeminiTool, GeminiResponse.
OllamaProvider
Implements the Provider trait for the Ollama REST API (default: http://localhost:11434).
| Constructor | Description |
|---|---|
new(model, base_url) |
Create with model name and optional custom URL |
with_rate_limit(model, base_url, rpm) |
Create with rate limiting |
Streaming: Line-delimited JSON where each line contains a message field and a done boolean.
Content handling: Multiple content blocks are flattened into a single concatenated text string, since Ollama's API expects plain text.
Internal types: OllamaMessage, OllamaTool, OllamaResponse.
Local LLM Subsystem
Always compiled (no feature gate). The actual llama.cpp inference requires the llama-cpp-2 feature.
LocalLlmConfig
Configuration for a local GGUF model.
| Field | Type | Default | Description |
|---|---|---|---|
id |
String |
"local-model" |
Unique model identifier |
name |
String |
"Local Model" |
Human-readable name |
model_path |
PathBuf |
— | Path to the .gguf file |
context_size |
u32 |
4096 |
Context window size in tokens |
num_threads |
Option<u32> |
None (auto) |
CPU threads for inference |
batch_size |
u32 |
512 |
Prompt processing batch size |
gpu_layers |
u32 |
0 |
GPU layers to offload (0 = CPU only) |
use_mmap |
bool |
true |
Memory-map model file for faster loading |
use_mlock |
bool |
false |
Lock model in RAM to prevent swapping |
max_tokens |
u32 |
2048 |
Maximum tokens per response |
model_type |
LocalModelType |
Lfm2 |
Model family for prompt formatting |
system_template |
Option<String> |
None |
Custom system prompt template |
supports_tools |
bool |
false |
Whether the model handles tool/function calling |
estimated_ram_mb |
Option<u32> |
None |
Estimated RAM usage (display only) |
Preset constructors:
| Preset | Context | RAM | Tools | Description |
|---|---|---|---|---|
lfm2_350m(path) |
32K | 220 MB | No | Fastest, routing and binary decisions |
lfm2_1_2b(path) |
32K | 700 MB | No | Sweet spot for agentic logic |
lfm2_2_6b_exp(path) |
32K | 1.5 GB | Yes | Complex reasoning and tool-calling |
granite_nano_350m(path) |
8K | 250 MB | No | Sub-second CPU responses |
granite_nano_1_5b(path) |
8K | 900 MB | No | Balanced performance |
Validation: validate() checks model path exists, context size > 0, batch size > 0.
LocalModelType
Model family enum that determines chat template formatting and stop tokens.
| Variant | Chat Template Style | Stop Tokens |
|---|---|---|
Lfm2 |
<|system|>...<|end|> |
<|end|>, <|user|> |
Lfm2Agentic |
Same as Lfm2 | Same as Lfm2 |
Granite |
<|system|>...\n |
<|user|>, <|system|> |
Qwen |
<|im_start|>...<|im_end|> |
<|im_end|>, <|im_start|> |
Llama |
<|begin_of_text|>...<|eot_id|> |
<|eot_id|>, <|start_header_id|> |
Phi |
Same as Lfm2 | Same as Lfm2 |
Generic |
### System:...\n### User:... |
### User:, ### System: |
Methods: chat_template(), stop_tokens().
LocalInferenceParams
Per-request sampling parameters.
| Field | Type | Default | Description |
|---|---|---|---|
temperature |
f32 |
0.7 |
Sampling temperature (0.0 = deterministic) |
top_p |
f32 |
0.9 |
Nucleus sampling threshold |
top_k |
u32 |
40 |
Top-k sampling parameter |
repeat_penalty |
f32 |
1.1 |
Repetition penalty (1.0 = none) |
max_tokens |
u32 |
2048 |
Maximum tokens to generate |
stop_sequences |
Vec<String> |
[] |
Custom stop sequences |
Presets:
| Preset | Temperature | Top-k | Max Tokens | Use Case |
|---|---|---|---|---|
factual() |
0.1 | 20 | 1024 | Deterministic, factual responses |
creative() |
0.9 | 50 | 2048 | Varied, creative output |
routing() |
0.0 | 1 | 50 | Classification and routing |
LocalModelRegistry
Manages registered local models with persistence to ~/.config/brainwires/local_models.json.
| Method | Description |
|---|---|
new() |
Create an empty registry |
with_default_dir() |
Create with default models directory (~/.local/share/brainwires/models/) |
register(config) |
Add a model configuration |
get(id) |
Get model by ID |
get_default() |
Get the default model |
set_default(id) |
Set the default model (returns false if ID not found) |
remove(id) |
Remove a model (clears default if it was the removed model) |
list() |
List all registered models |
scan_models_dir() |
Auto-discover .gguf files and register them with detected model types |
load() |
Load registry from config file |
save() |
Save registry to config file |
Auto-detection: scan_models_dir() reads the models directory, infers LocalModelType from filenames (e.g., lfm2 → Lfm2, granite → Granite), and estimates context size and RAM from model size indicators in the filename.
KnownModel
Pre-configured model definitions for easy discovery and downloading.
| Field | Description |
|---|---|
id |
Model identifier (e.g., "lfm2-1.2b") |
name |
Human-readable name |
huggingface_repo |
HuggingFace repository path |
filename |
Expected GGUF filename |
model_type |
LocalModelType variant |
context_size |
Context window size |
estimated_ram_mb |
RAM requirement |
supports_tools |
Tool-calling support |
description |
Short description |
Access via known_models() (full list) or get_known_model(id) (by ID).
LocalLlmProvider
Implements the Provider trait for local GGUF model inference. Lazy-loads the model on first use.
| Method | Description |
|---|---|
new(config) |
Create provider (validates config, does not load model) |
lfm2_350m(path) |
Shorthand for LFM2 350M preset |
lfm2_1_2b(path) |
Shorthand for LFM2 1.2B preset |
config() |
Get the model configuration |
is_loaded() |
Check if model is in memory |
load() |
Load model into memory (initializes llama.cpp backend) |
unload() |
Release model from memory |
generate(prompt, params) |
Generate text with custom parameters |
route(prompt) |
Quick routing/classification (deterministic params) |
process(prompt) |
Summarization/processing (factual params) |
Without the llama-cpp-2 feature, load() and generate() return an error directing the user to enable the feature.
LocalLlmPool
Round-robin pool of LocalLlmProvider instances for parallel inference.
| Method | Description |
|---|---|
new(config, instances) |
Create pool with N identical provider instances |
next() |
Get the next provider (round-robin via AtomicUsize) |
load_all() |
Load all models in the pool |
unload_all() |
Unload all models |
size() |
Number of instances |
estimated_ram_mb() |
Total estimated RAM for the pool |
LocalLlmConfigError
| Variant | Description |
|---|---|
MissingModelPath |
Model path is empty |
ModelNotFound(PathBuf) |
File does not exist at the given path |
InvalidContextSize |
Context size is 0 |
InvalidBatchSize |
Batch size is 0 |
ModelLoadError(String) |
llama.cpp failed to load the model |
InferenceError(String) |
Error during token generation |
Usage Examples
Stream a response from OpenAI
use ;
use ;
use StreamExt;
let provider = new;
let messages = vec!;
let options = default;
let mut stream = provider.stream_chat;
while let Some = stream.next.await
Use tools with the Anthropic provider
use ;
use Message;
use Tool;
let provider = new;
let tools = vec!;
let messages = vec!;
let options = default;
let response = provider.chat.await?;
// response.message may contain tool_use content blocks
Rate-limited HTTP requests
use ;
// Standalone rate limiter
let limiter = new; // 60 RPM
limiter.acquire.await; // blocks if depleted
// Rate-limited HTTP client
let client = with_rate_limit; // 120 RPM
let response = client.post
.await
.json
.send
.await?;
println!;
Provider with rate limiting
use ;
use Message;
// Create provider with 60 requests-per-minute limit
let provider = with_rate_limit;
let messages = vec!;
let response = provider.chat.await?;
Configure a provider with ProviderConfig
use ;
let config = new
.with_api_key
.with_base_url;
assert_eq!;
assert_eq!;
// Parse provider from string
let provider_type: ProviderType = "gemini".parse?; // → Google
Local LLM inference
use ;
use PathBuf;
// Create provider from a preset
let provider = lfm2_1_2b?;
// Load model into memory
provider.load.await?;
// Quick routing (deterministic, max 50 tokens)
let route = provider.route.await?;
// Full inference with custom params
let result = provider.generate.await?;
// Or use via the Provider trait
use ;
use Message;
let messages = vec!;
let response = provider.chat.await?;
// Unload when done
provider.unload.await;
Model registry and auto-discovery
use ;
use PathBuf;
// Load or create registry
let mut registry = load?;
// Register a model manually
registry.register;
registry.set_default;
// Auto-discover GGUF files in the models directory
let discovered = registry.scan_models_dir?;
for id in &discovered
// Browse known/recommended models
for model in known_models
// Save registry
registry.save?;
Local LLM pool for parallel inference
use ;
use PathBuf;
let config = lfm2_350m;
let pool = new?; // 4 instances
pool.load_all.await?;
println!;
// Round-robin across instances
let provider = pool.next;
let result = provider.route.await?;
pool.unload_all.await;
Integration
Use via the brainwires facade crate with the providers feature, or depend on brainwires-providers directly:
# Via facade
[]
= { = "0.6", = ["providers"] }
# Direct
[]
= "0.6"
Re-exports at crate root for convenience:
use ;
License
Licensed under the MIT License. See LICENSE for details.