> **⚠️ Early Stage / Alpha Software**
>
> This project is in active early development. The API surface may change, and not all edge cases are handled yet. Contributions are welcome — see [Contributing](#contributing) below. Use at your own risk.
---
# llm-cascade
**Resilient, cascading LLM inference across multiple providers — failover, circuit breaking, and retry cooldowns built in.**
`llm-cascade` is a Rust library and CLI that sends prompts to an ordered list of LLM providers (OpenAI, Anthropic, Google Gemini, Ollama, and any OpenAI-compatible endpoint like Groq or Together). If one provider is rate-limited or down, it automatically falls through to the next, tracks per-entry cooldowns in SQLite, and persists failed prompts as JSON files.
---
## Features
- **Cascading failover** — define ordered provider/model lists; the first successful response wins
- **OpenAI-compatible providers** — point any `openai`-type provider at a custom `base_url` (Groq, Together, Z.AI, vLLM, etc.)
- **Per-entry circuit breaker** — cooldowns tracked per `provider/model` pair in SQLite
- **429-aware backoff** — parses `retry-after` headers; falls back to exponential backoff (30 s base, 1 h cap)
- **Cross-process state** — cooldown state persists across CLI invocations via SQLite
- **Secret management** — OS keyring (via `keyring`) with environment variable fallback
- **Failure persistence** — total cascade failures saved as timestamped `.json` files
- **Full audit log** — every attempt logged with timestamp, status, latency, and token counts
- **Dual interface** — use as a CLI tool or as an async library in your own Rust projects
---
## How It Works
```
┌──────────┐ ┌─────────────────────────────────────────────┐
│ CLI / │ │ Cascade Engine │
│ Library │──────▶│ │
│ Caller │ │ ┌───────────┐ ┌───────────┐ ┌─────────┐ │
└──────────┘ │ │ openai/ │─▶│ anthropic/│─▶│ ollama/ │ │
│ │ gpt-4o │ │ claude… │ │ llama3 │ │
│ └────┬──────┘ └────┬──────┘ └────┬────┘ │
│ │ │ │ │
│ ┌────▼──────────────▼──────────────▼────┐ │
│ │ SQLite Database │ │
│ │ • attempt_log (audit trail) │ │
│ │ • cooldown (circuit breaker state) │ │
│ └───────────────────────────────────────┘ │
│ │
│ On total failure: │
│ ┌───────────────────────────────────────┐ │
│ │ failed_prompts/cascade_20260414.json │ │
│ └───────────────────────────────────────┘ │
└─────────────────────────────────────────────┘
```
1. **Load config** from `~/.config/llm-cascade/config.toml` (or a custom path).
2. **Initialize SQLite** — creates `attempt_log` and `cooldown` tables if missing.
3. **Iterate cascade entries** — for each `provider/model` in the named cascade:
- Check if the entry is on cooldown in the DB → skip if so.
- Resolve the API key (keyring → env var).
- Send the `Conversation` to the provider's API.
- Log the attempt (status, latency, tokens).
- On success → return `LlmResponse` immediately.
- On failure → set cooldown (from `retry-after` header or exponential backoff) and continue.
4. **Total failure** → persist the `Conversation` as a `.json` file, return `CascadeError` with the file path.
---
## Installation
### From source
```sh
git clone https://github.com/paluigi/llm-cascade.git
cd llm-cascade
cargo install --path .
```
### As a library dependency
```toml
# Cargo.toml
[dependencies]
llm-cascade = "0.1"
```
> Requires Rust **1.85+** (edition 2024).
---
## Configuration
Copy the example config to the default location:
```sh
mkdir -p ~/.config/llm-cascade
cp config.example.toml ~/.config/llm-cascade/config.toml
```
Edit `~/.config/llm-cascade/config.toml`:
```toml
# ── Provider Definitions ────────────────────────────────────
# Each block defines an endpoint (type, base_url, auth).
# Providers are referenced by name in cascades and can be
# reused with different models — no need to duplicate config.
[providers.openai]
type = "openai"
api_key_service = "openai" # keyring entry name
api_key_env = "OPENAI_API_KEY" # env var fallback
# base_url defaults to https://api.openai.com/v1
[providers.anthropic]
type = "anthropic"
api_key_service = "anthropic"
api_key_env = "ANTHROPIC_API_KEY"
# base_url defaults to https://api.anthropic.com
[providers.gemini]
type = "gemini"
api_key_service = "gemini"
api_key_env = "GOOGLE_API_KEY"
# base_url defaults to https://generativelanguage.googleapis.com
[providers.groq]
type = "openai" # reuse OpenAI-compatible protocol
base_url = "https://api.groq.com/openai/v1"
api_key_service = "groq"
api_key_env = "GROQ_API_KEY"
[providers.ollama]
type = "ollama"
base_url = "http://localhost:11434"
# No API key needed
# ── Cascades ───────────────────────────────────────────────
# Each entry references a provider by name and specifies a model.
# The same provider can appear multiple times with different models.
[cascades.creative_task]
entries = [
{ provider = "openai", model = "gpt-4o" },
{ provider = "anthropic", model = "claude-sonnet-4-20250514" },
{ provider = "gemini", model = "gemini-2.0-flash" },
]
[cascades.fast_task]
entries = [
{ provider = "ollama", model = "llama3" },
{ provider = "groq", model = "llama-3.3-70b-versatile" },
{ provider = "openai", model = "gpt-4o-mini" },
]
[cascades.resilient_task]
entries = [
{ provider = "openai", model = "gpt-4o" },
{ provider = "openai", model = "gpt-4o-mini" }, # same provider, different model
{ provider = "groq", model = "llama-3.3-70b-versatile" },
{ provider = "anthropic", model = "claude-sonnet-4-20250514" },
{ provider = "ollama", model = "llama3" },
]
# ── Persistence ────────────────────────────────────────────
[database]
path = "~/.local/share/llm-cascade/db.sqlite"
[failure_persistence]
dir = "~/.local/share/llm-cascade/failed_prompts"
```
### Provider types
| `openai` | OpenAI Chat Completions API | `https://api.openai.com/v1` |
| `anthropic` | Anthropic Messages API | `https://api.anthropic.com` |
| `gemini` | Google Gemini generateContent API | `https://generativelanguage.googleapis.com` |
| `ollama` | Ollama local inference | `http://localhost:11434` |
Any provider with `type = "openai"` can be pointed at a custom `base_url` to use OpenAI-compatible services such as **Groq**, **Together AI**, **Z.AI**, **vLLM**, **LiteLLM**, etc.
### API Keys
| **OS Keyring** (preferred) | Set via `keyring` CLI or your desktop's secret service. The `api_key_service` field is the keyring entry name. |
| **Environment variable** | Export the variable named in `api_key_env` (e.g., `export OPENAI_API_KEY=sk-...`). |
| **Ollama** | No API key needed for local models. |
The library tries the keyring first and falls back to the environment variable automatically.
---
## CLI Usage
### Basic prompt
```sh
llm-cascade -C creative_task -p "Write a haiku about Rust"
```
### From a JSON conversation file
```sh
llm-cascade -C creative_task -f conversation.json
```
The JSON file must match the `Conversation` schema:
```json
{
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "What is 2 + 2?" }
],
"tools": [
{
"name": "get_weather",
"description": "Get the current weather",
"parameters": {
"type": "object",
"properties": {
"location": { "type": "string" }
},
"required": ["location"]
}
}
]
}
```
### Custom config path
```sh
llm-cascade -c /path/to/my/config.toml -C fast_task -p "Hello"
```
### Output
- **Text responses** are printed to stdout.
- **Tool call responses** are printed as pretty JSON to stdout.
- **Errors** (including `CascadeError` with the `.json` file path) are printed to stderr with exit code 1.
### All options
```
Usage: llm-cascade [OPTIONS] --cascade <CASCADE>
Options:
-C, --cascade <CASCADE> Name of the cascade to use (defined in config)
-p, --prompt <PROMPT> Text prompt to send
-f, --file <FILE> Path to a JSON file containing the conversation
-c, --config <CONFIG> Path to config file (default: ~/.config/llm-cascade/config.toml)
-h, --help Print help
```
### Verbosity
Control log output via the `RUST_LOG` environment variable:
```sh
RUST_LOG=debug llm-cascade -C creative_task -p "Hello"
RUST_LOG=llm_cascade=trace llm-cascade -C creative_task -p "Hello"
```
---
## Library Usage
Use `llm-cascade` as an async library in any Rust project:
```rust
use llm_cascade::{run_cascade, load_config, db, Conversation, Message, MessageRole};
#[tokio::main]
async fn main() {
let config = load_config(&"config.toml".into()).expect("config");
let conn = db::init_db(&config.database.path).expect("db");
let conversation = Conversation::new(vec![
Message::system("You are a concise assistant."),
Message::user("What is the capital of France?"),
]);
match run_cascade("creative_task", &conversation, &config, &conn).await {
Ok(response) => {
println!("Model: {}", response.model);
println!("Response: {}", response.text_only());
if let (Some(in), Some(out)) = (response.input_tokens, response.output_tokens) {
println!("Tokens: {} in / {} out", in, out);
}
}
Err(e) => {
eprintln!("Cascade failed: {}", e);
}
}
}
```
### With tool definitions
```rust
use llm_cascade::{run_cascade, load_config, db, Conversation, Message, ToolDefinition};
use serde_json::json;
let conversation = Conversation::new(vec![
Message::user("What's the weather in Tokyo?"),
]).with_tools(vec![
ToolDefinition {
name: "get_weather".into(),
description: "Get current weather for a location".into(),
parameters: json!({
"type": "object",
"properties": {
"location": { "type": "string" }
},
"required": ["location"]
}),
},
]);
let response = run_cascade("creative_task", &conversation, &config, &conn).await?;
```
### Key types
| `Conversation` | Holds `messages: Vec<Message>` and optional `tools: Vec<ToolDefinition>` |
| `Message` | A single message with `role` (`System`/`User`/`Assistant`/`Tool`), `content`, and optional `tool_call_id` |
| `ToolDefinition` | Tool name, description, and JSON Schema parameters |
| `LlmResponse` | Response with `content: Vec<ContentBlock>`, token counts, and model name |
| `ContentBlock` | Either `Text { text }` or `ToolCall { id, name, arguments }` |
| `CascadeError` | Contains cascade name, error message, and absolute path to the persisted `.json` file |
| `ProviderError` | HTTP status, body, optional `retry_after` seconds |
---
## API Reference
### `run_cascade`
```rust
pub async fn run_cascade(
cascade_name: &str,
conversation: &Conversation,
config: &AppConfig,
conn: &Connection,
) -> Result<LlmResponse, CascadeError>
```
The core entry point. Iterates through the named cascade's provider entries, skipping those on cooldown, and returns the first successful `LlmResponse`.
### `db::init_db`
```rust
pub fn init_db(path: &str) -> Result<Connection, String>
```
Opens (or creates) the SQLite database and ensures the schema exists. Expands `~` in the path.
### `db::log_attempt`
```rust
pub fn log_attempt(
conn: &Connection,
cascade_name: &str,
provider_model: &str,
http_status: Option<u16>,
latency_ms: u64,
input_tokens: Option<u32>,
output_tokens: Option<u32>,
)
```
Inserts a row into the `attempt_log` table.
### `db::is_on_cooldown` / `db::set_cooldown`
```rust
pub fn is_on_cooldown(conn: &Connection, provider_model: &str) -> bool
pub fn set_cooldown(conn: &Connection, provider_model: &str, cooldown_until: &str)
```
Query and update the `cooldown` table. Timestamps are RFC 3339 strings.
### `load_config`
```rust
pub fn load_config(path: &Path) -> Result<AppConfig, String>
```
Reads and parses the TOML configuration file.
---
## Cooldown & Backoff Behavior
| HTTP 429 with `retry-after` header | Value from header (seconds) |
| HTTP 429 without header | 30 s (doubles per consecutive failure, max 1 h) |
| Other HTTP error (4xx/5xx) | 30 s base, exponential doubling |
| Successful response | No cooldown set |
Cooldowns are **per entry** (e.g., `openai/gpt-4o` can be on cooldown while `openai/gpt-3.5-turbo` stays active) and **persisted in SQLite** so separate CLI invocations share the same state.
---
## Roadmap
- [ ] Streaming response support
- [ ] Configurable per-provider timeouts
- [ ] Token budget limits per cascade
- [ ] Retry with modified parameters (e.g., lower temperature)
- [ ] Prometheus metrics export
- [ ] Web dashboard for cooldown/attempt monitoring
- [ ] Additional native providers (Mistral, Cohere, AWS Bedrock, Azure OpenAI)
- [x] Published crate on crates.io
---
## Contributing
Contributions are welcome! This is an open-source project under the MIT license.
### Getting started
1. Fork the repository
2. Clone your fork: `git clone https://github.com/paluigi/llm-cascade.git`
3. Create a branch: `git checkout -b feature/your-feature`
4. Build and test: `cargo build && cargo clippy -- -D warnings`
### Making changes
- Follow the existing code style (no comments unless necessary, concise naming)
- Ensure `cargo clippy -- -D warnings` passes with zero warnings
- Update this README if you change public API or configuration
### Submitting
1. Push to your fork: `git push origin feature/your-feature`
2. Open a Pull Request against the `main` branch
3. Describe your changes and the motivation behind them
### Reporting issues
Use the [GitHub issue tracker](https://github.com/paluigi/llm-cascade/issues) to report bugs, request features, or ask questions. Please include:
- Rust version (`rustc --version`)
- OS and version
- Minimal reproduction steps
- Relevant log output (with `RUST_LOG=debug`)
---
## License
[MIT](LICENSE) — Copyright (c) 2026 Luigi Palumbo