⚠️ Early Stage / Alpha Software
This project is in active early development. The API surface may change, and not all edge cases are handled yet. Contributions are welcome — see Contributing below. Use at your own risk.
llm-cascade
Resilient, cascading LLM inference across multiple providers — failover, circuit breaking, and retry cooldowns built in.
llm-cascade is a Rust library and CLI that sends prompts to an ordered list of LLM providers (OpenAI, Anthropic, Google Gemini, Ollama, and any OpenAI-compatible endpoint like Groq or Together). If one provider is rate-limited or down, it automatically falls through to the next, tracks per-entry cooldowns in SQLite, and persists failed prompts as JSON files.
Features
- Cascading failover — define ordered provider/model lists; the first successful response wins
- OpenAI-compatible providers — point any
openai-type provider at a custombase_url(Groq, Together, Z.AI, vLLM, etc.) - Per-entry circuit breaker — cooldowns tracked per
provider/modelpair in SQLite - 429-aware backoff — parses
retry-afterheaders; falls back to exponential backoff (30 s base, 1 h cap) - Cross-process state — cooldown state persists across CLI invocations via SQLite
- Secret management — OS keyring (via
keyring) with environment variable fallback - Failure persistence — total cascade failures saved as timestamped
.jsonfiles - Full audit log — every attempt logged with timestamp, status, latency, and token counts
- Dual interface — use as a CLI tool or as an async library in your own Rust projects
How It Works
┌──────────┐ ┌─────────────────────────────────────────────┐
│ CLI / │ │ Cascade Engine │
│ Library │──────▶│ │
│ Caller │ │ ┌───────────┐ ┌───────────┐ ┌─────────┐ │
└──────────┘ │ │ openai/ │─▶│ anthropic/│─▶│ ollama/ │ │
│ │ gpt-4o │ │ claude… │ │ llama3 │ │
│ └────┬──────┘ └────┬──────┘ └────┬────┘ │
│ │ │ │ │
│ ┌────▼──────────────▼──────────────▼────┐ │
│ │ SQLite Database │ │
│ │ • attempt_log (audit trail) │ │
│ │ • cooldown (circuit breaker state) │ │
│ └───────────────────────────────────────┘ │
│ │
│ On total failure: │
│ ┌───────────────────────────────────────┐ │
│ │ failed_prompts/cascade_20260414.json │ │
│ └───────────────────────────────────────┘ │
└─────────────────────────────────────────────┘
- Load config from
~/.config/llm-cascade/config.toml(or a custom path). - Initialize SQLite — creates
attempt_logandcooldowntables if missing. - Iterate cascade entries — for each
provider/modelin the named cascade:- Check if the entry is on cooldown in the DB → skip if so.
- Resolve the API key (keyring → env var).
- Send the
Conversationto the provider's API. - Log the attempt (status, latency, tokens).
- On success → return
LlmResponseimmediately. - On failure → set cooldown (from
retry-afterheader or exponential backoff) and continue.
- Total failure → persist the
Conversationas a.jsonfile, returnCascadeErrorwith the file path.
Installation
From source
As a library dependency
# Cargo.toml
[]
= "0.1"
Requires Rust 1.85+ (edition 2024).
Configuration
Copy the example config to the default location:
Edit ~/.config/llm-cascade/config.toml:
# ── Provider Definitions ────────────────────────────────────
# Each block defines an endpoint (type, base_url, auth).
# Providers are referenced by name in cascades and can be
# reused with different models — no need to duplicate config.
[]
= "openai"
= "openai" # keyring entry name
= "OPENAI_API_KEY" # env var fallback
# base_url defaults to https://api.openai.com/v1
[]
= "anthropic"
= "anthropic"
= "ANTHROPIC_API_KEY"
# base_url defaults to https://api.anthropic.com
[]
= "gemini"
= "gemini"
= "GOOGLE_API_KEY"
# base_url defaults to https://generativelanguage.googleapis.com
[]
= "openai" # reuse OpenAI-compatible protocol
= "https://api.groq.com/openai/v1"
= "groq"
= "GROQ_API_KEY"
[]
= "ollama"
= "http://localhost:11434"
# No API key needed
# ── Cascades ───────────────────────────────────────────────
# Each entry references a provider by name and specifies a model.
# The same provider can appear multiple times with different models.
[]
= [
{ = "openai", = "gpt-4o" },
{ = "anthropic", = "claude-sonnet-4-20250514" },
{ = "gemini", = "gemini-2.0-flash" },
]
[]
= [
{ = "ollama", = "llama3" },
{ = "groq", = "llama-3.3-70b-versatile" },
{ = "openai", = "gpt-4o-mini" },
]
[]
= [
{ = "openai", = "gpt-4o" },
{ = "openai", = "gpt-4o-mini" }, # same provider, different model
{ = "groq", = "llama-3.3-70b-versatile" },
{ = "anthropic", = "claude-sonnet-4-20250514" },
{ = "ollama", = "llama3" },
]
# ── Persistence ────────────────────────────────────────────
[]
= "~/.local/share/llm-cascade/db.sqlite"
[]
= "~/.local/share/llm-cascade/failed_prompts"
Provider types
| Type | Description | Default base_url |
|---|---|---|
openai |
OpenAI Chat Completions API | https://api.openai.com/v1 |
anthropic |
Anthropic Messages API | https://api.anthropic.com |
gemini |
Google Gemini generateContent API | https://generativelanguage.googleapis.com |
ollama |
Ollama local inference | http://localhost:11434 |
Any provider with type = "openai" can be pointed at a custom base_url to use OpenAI-compatible services such as Groq, Together AI, Z.AI, vLLM, LiteLLM, etc.
API Keys
| Method | How it works |
|---|---|
| OS Keyring (preferred) | Set via keyring CLI or your desktop's secret service. The api_key_service field is the keyring entry name. |
| Environment variable | Export the variable named in api_key_env (e.g., export OPENAI_API_KEY=sk-...). |
| Ollama | No API key needed for local models. |
The library tries the keyring first and falls back to the environment variable automatically.
CLI Usage
Basic prompt
From a JSON conversation file
The JSON file must match the Conversation schema:
Custom config path
Output
- Text responses are printed to stdout.
- Tool call responses are printed as pretty JSON to stdout.
- Errors (including
CascadeErrorwith the.jsonfile path) are printed to stderr with exit code 1.
All options
Usage: llm-cascade [OPTIONS] --cascade <CASCADE>
Options:
-C, --cascade <CASCADE> Name of the cascade to use (defined in config)
-p, --prompt <PROMPT> Text prompt to send
-f, --file <FILE> Path to a JSON file containing the conversation
-c, --config <CONFIG> Path to config file (default: ~/.config/llm-cascade/config.toml)
-h, --help Print help
Verbosity
Control log output via the RUST_LOG environment variable:
RUST_LOG=debug
RUST_LOG=llm_cascade=trace
Library Usage
Use llm-cascade as an async library in any Rust project:
use ;
async
With tool definitions
use ;
use json;
let conversation = new.with_tools;
let response = run_cascade.await?;
Key types
| Type | Description |
|---|---|
Conversation |
Holds messages: Vec<Message> and optional tools: Vec<ToolDefinition> |
Message |
A single message with role (System/User/Assistant/Tool), content, and optional tool_call_id |
ToolDefinition |
Tool name, description, and JSON Schema parameters |
LlmResponse |
Response with content: Vec<ContentBlock>, token counts, and model name |
ContentBlock |
Either Text { text } or ToolCall { id, name, arguments } |
CascadeError |
Contains cascade name, error message, and absolute path to the persisted .json file |
ProviderError |
HTTP status, body, optional retry_after seconds |
API Reference
run_cascade
pub async
The core entry point. Iterates through the named cascade's provider entries, skipping those on cooldown, and returns the first successful LlmResponse.
db::init_db
Opens (or creates) the SQLite database and ensures the schema exists. Expands ~ in the path.
db::log_attempt
Inserts a row into the attempt_log table.
db::is_on_cooldown / db::set_cooldown
Query and update the cooldown table. Timestamps are RFC 3339 strings.
load_config
Reads and parses the TOML configuration file.
Cooldown & Backoff Behavior
| Scenario | Cooldown Duration |
|---|---|
HTTP 429 with retry-after header |
Value from header (seconds) |
| HTTP 429 without header | 30 s (doubles per consecutive failure, max 1 h) |
| Other HTTP error (4xx/5xx) | 30 s base, exponential doubling |
| Successful response | No cooldown set |
Cooldowns are per entry (e.g., openai/gpt-4o can be on cooldown while openai/gpt-3.5-turbo stays active) and persisted in SQLite so separate CLI invocations share the same state.
Roadmap
- Streaming response support
- Configurable per-provider timeouts
- Token budget limits per cascade
- Retry with modified parameters (e.g., lower temperature)
- Prometheus metrics export
- Web dashboard for cooldown/attempt monitoring
- Additional native providers (Mistral, Cohere, AWS Bedrock, Azure OpenAI)
- Published crate on crates.io
Contributing
Contributions are welcome! This is an open-source project under the MIT license.
Getting started
- Fork the repository
- Clone your fork:
git clone https://github.com/paluigi/llm-cascade.git - Create a branch:
git checkout -b feature/your-feature - Build and test:
cargo build && cargo clippy -- -D warnings
Making changes
- Follow the existing code style (no comments unless necessary, concise naming)
- Ensure
cargo clippy -- -D warningspasses with zero warnings - Update this README if you change public API or configuration
Submitting
- Push to your fork:
git push origin feature/your-feature - Open a Pull Request against the
mainbranch - Describe your changes and the motivation behind them
Reporting issues
Use the GitHub issue tracker to report bugs, request features, or ask questions. Please include:
- Rust version (
rustc --version) - OS and version
- Minimal reproduction steps
- Relevant log output (with
RUST_LOG=debug)
License
MIT — Copyright (c) 2026 Luigi Palumbo