⚠️ Early Stage / Alpha Software

This project is in active early development. The API surface may change, and not all edge cases are handled yet. Contributions are welcome — see Contributing below. Use at your own risk.

llm-cascade

Resilient, cascading LLM inference across multiple providers — failover, circuit breaking, and retry cooldowns built in.

llm-cascade is a Rust library and CLI that sends prompts to an ordered list of LLM providers (OpenAI, Anthropic, Google Gemini, Ollama, and any OpenAI-compatible endpoint like Groq or Together). If one provider is rate-limited or down, it automatically falls through to the next, tracks per-entry cooldowns in SQLite, and persists failed prompts as JSON files.

Features

Cascading failover — define ordered provider/model lists; the first successful response wins
OpenAI-compatible providers — point any openai-type provider at a custom base_url (Groq, Together, Z.AI, vLLM, etc.)
Per-entry circuit breaker — cooldowns tracked per provider/model pair in SQLite
429-aware backoff — parses retry-after headers; falls back to exponential backoff (30 s base, 1 h cap)
Cross-process state — cooldown state persists across CLI invocations via SQLite
Secret management — OS keyring (via keyring) with environment variable fallback
Failure persistence — total cascade failures saved as timestamped .json files
Full audit log — every attempt logged with timestamp, status, latency, and token counts
Dual interface — use as a CLI tool or as an async library in your own Rust projects

How It Works

┌──────────┐       ┌─────────────────────────────────────────────┐
│  CLI /   │       │              Cascade Engine                  │
│  Library │──────▶│                                             │
│  Caller  │       │  ┌───────────┐  ┌───────────┐  ┌─────────┐ │
└──────────┘       │  │ openai/   │─▶│ anthropic/│─▶│ ollama/ │ │
                   │  │ gpt-4o    │  │ claude…   │  │ llama3  │ │
                   │  └────┬──────┘  └────┬──────┘  └────┬────┘ │
                   │       │              │              │       │
                   │  ┌────▼──────────────▼──────────────▼────┐  │
                   │  │           SQLite Database              │  │
                   │  │  • attempt_log (audit trail)           │  │
                   │  │  • cooldown    (circuit breaker state)  │  │
                   │  └───────────────────────────────────────┘  │
                   │                                             │
                   │  On total failure:                          │
                   │  ┌───────────────────────────────────────┐  │
                    │  │  failed_prompts/cascade_20260414.json │  │
                   │  └───────────────────────────────────────┘  │
                   └─────────────────────────────────────────────┘

Load config from ~/.config/llm-cascade/config.toml (or a custom path).
Initialize SQLite — creates attempt_log and cooldown tables if missing.
Iterate cascade entries — for each provider/model in the named cascade:
- Check if the entry is on cooldown in the DB → skip if so.
- Resolve the API key (keyring → env var).
- Send the Conversation to the provider's API.
- Log the attempt (status, latency, tokens).
- On success → return LlmResponse immediately.
- On failure → set cooldown (from retry-after header or exponential backoff) and continue.
Total failure → persist the Conversation as a .json file, return CascadeError with the file path.

Installation

From source

git clone https://github.com/paluigi/llm-cascade.git
cd llm-cascade
cargo install --path .

As a library dependency

# Cargo.toml
[dependencies]
llm-cascade = "0.1"

Requires Rust 1.85+ (edition 2024).

Configuration

Copy the example config to the default location:

mkdir -p ~/.config/llm-cascade
cp config.example.toml ~/.config/llm-cascade/config.toml

Edit ~/.config/llm-cascade/config.toml:

# ── Provider Definitions ────────────────────────────────────
# Each block defines an endpoint (type, base_url, auth).
# Providers are referenced by name in cascades and can be
# reused with different models — no need to duplicate config.

[providers.openai]
type = "openai"
api_key_service = "openai"          # keyring entry name
api_key_env = "OPENAI_API_KEY"      # env var fallback
# base_url defaults to https://api.openai.com/v1

[providers.anthropic]
type = "anthropic"
api_key_service = "anthropic"
api_key_env = "ANTHROPIC_API_KEY"
# base_url defaults to https://api.anthropic.com

[providers.gemini]
type = "gemini"
api_key_service = "gemini"
api_key_env = "GOOGLE_API_KEY"
# base_url defaults to https://generativelanguage.googleapis.com

[providers.groq]
type = "openai"                     # reuse OpenAI-compatible protocol
base_url = "https://api.groq.com/openai/v1"
api_key_service = "groq"
api_key_env = "GROQ_API_KEY"

[providers.ollama]
type = "ollama"
base_url = "http://localhost:11434"
# No API key needed

# ── Cascades ───────────────────────────────────────────────
# Each entry references a provider by name and specifies a model.
# The same provider can appear multiple times with different models.

[cascades.creative_task]
entries = [
    { provider = "openai", model = "gpt-4o" },
    { provider = "anthropic", model = "claude-sonnet-4-20250514" },
    { provider = "gemini", model = "gemini-2.0-flash" },
]

[cascades.fast_task]
entries = [
    { provider = "ollama", model = "llama3" },
    { provider = "groq", model = "llama-3.3-70b-versatile" },
    { provider = "openai", model = "gpt-4o-mini" },
]

[cascades.resilient_task]
entries = [
    { provider = "openai", model = "gpt-4o" },
    { provider = "openai", model = "gpt-4o-mini" },       # same provider, different model
    { provider = "groq", model = "llama-3.3-70b-versatile" },
    { provider = "anthropic", model = "claude-sonnet-4-20250514" },
    { provider = "ollama", model = "llama3" },
]

# ── Persistence ────────────────────────────────────────────

[database]
path = "~/.local/share/llm-cascade/db.sqlite"

[failure_persistence]
dir = "~/.local/share/llm-cascade/failed_prompts"

Provider types

Type	Description	Default `base_url`
`openai`	OpenAI Chat Completions API	`https://api.openai.com/v1`
`anthropic`	Anthropic Messages API	`https://api.anthropic.com`
`gemini`	Google Gemini generateContent API	`https://generativelanguage.googleapis.com`
`ollama`	Ollama local inference	`http://localhost:11434`

Any provider with type = "openai" can be pointed at a custom base_url to use OpenAI-compatible services such as Groq, Together AI, Z.AI, vLLM, LiteLLM, etc.

API Keys

Method	How it works
OS Keyring (preferred)	Set via `keyring` CLI or your desktop's secret service. The `api_key_service` field is the keyring entry name.
Environment variable	Export the variable named in `api_key_env` (e.g., `export OPENAI_API_KEY=sk-...`).
Ollama	No API key needed for local models.

The library tries the keyring first and falls back to the environment variable automatically.

CLI Usage

Basic prompt

llm-cascade -C creative_task -p "Write a haiku about Rust"

From a JSON conversation file

llm-cascade -C creative_task -f conversation.json

The JSON file must match the Conversation schema:

{
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "What is 2 + 2?" }
  ],
  "tools": [
    {
      "name": "get_weather",
      "description": "Get the current weather",
      "parameters": {
        "type": "object",
        "properties": {
          "location": { "type": "string" }
        },
        "required": ["location"]
      }
    }
  ]
}

Custom config path

llm-cascade -c /path/to/my/config.toml -C fast_task -p "Hello"

Output

Text responses are printed to stdout.
Tool call responses are printed as pretty JSON to stdout.
Errors (including CascadeError with the .json file path) are printed to stderr with exit code 1.

All options

Usage: llm-cascade [OPTIONS] --cascade <CASCADE>

Options:
  -C, --cascade <CASCADE>  Name of the cascade to use (defined in config)
  -p, --prompt <PROMPT>    Text prompt to send
  -f, --file <FILE>        Path to a JSON file containing the conversation
  -c, --config <CONFIG>    Path to config file (default: ~/.config/llm-cascade/config.toml)
  -h, --help               Print help

Verbosity

Control log output via the RUST_LOG environment variable:

RUST_LOG=debug llm-cascade -C creative_task -p "Hello"
RUST_LOG=llm_cascade=trace llm-cascade -C creative_task -p "Hello"

Library Usage

Use llm-cascade as an async library in any Rust project:

use llm_cascade::{run_cascade, load_config, db, Conversation, Message, MessageRole};

#[tokio::main]
async fn main() {
    let config = load_config(&"config.toml".into()).expect("config");
    let conn = db::init_db(&config.database.path).expect("db");

    let conversation = Conversation::new(vec![
        Message::system("You are a concise assistant."),
        Message::user("What is the capital of France?"),
    ]);

    match run_cascade("creative_task", &conversation, &config, &conn).await {
        Ok(response) => {
            println!("Model: {}", response.model);
            println!("Response: {}", response.text_only());
            if let (Some(in), Some(out)) = (response.input_tokens, response.output_tokens) {
                println!("Tokens: {} in / {} out", in, out);
            }
        }
        Err(e) => {
            eprintln!("Cascade failed: {}", e);
        }
    }
}

With tool definitions

use llm_cascade::{run_cascade, load_config, db, Conversation, Message, ToolDefinition};
use serde_json::json;

let conversation = Conversation::new(vec![
    Message::user("What's the weather in Tokyo?"),
]).with_tools(vec![
    ToolDefinition {
        name: "get_weather".into(),
        description: "Get current weather for a location".into(),
        parameters: json!({
            "type": "object",
            "properties": {
                "location": { "type": "string" }
            },
            "required": ["location"]
        }),
    },
]);

let response = run_cascade("creative_task", &conversation, &config, &conn).await?;

Key types

Type	Description
`Conversation`	Holds `messages: Vec<Message>` and optional `tools: Vec<ToolDefinition>`
`Message`	A single message with `role` (`System`/`User`/`Assistant`/`Tool`), `content`, and optional `tool_call_id`
`ToolDefinition`	Tool name, description, and JSON Schema parameters
`LlmResponse`	Response with `content: Vec<ContentBlock>`, token counts, and model name
`ContentBlock`	Either `Text { text }` or `ToolCall { id, name, arguments }`
`CascadeError`	Contains cascade name, error message, and absolute path to the persisted `.json` file
`ProviderError`	HTTP status, body, optional `retry_after` seconds

API Reference

`run_cascade`

pub async fn run_cascade(
    cascade_name: &str,
    conversation: &Conversation,
    config: &AppConfig,
    conn: &Connection,
) -> Result<LlmResponse, CascadeError>

The core entry point. Iterates through the named cascade's provider entries, skipping those on cooldown, and returns the first successful LlmResponse.

`db::init_db`

pub fn init_db(path: &str) -> Result<Connection, String>

Opens (or creates) the SQLite database and ensures the schema exists. Expands ~ in the path.

`db::log_attempt`

pub fn log_attempt(
    conn: &Connection,
    cascade_name: &str,
    provider_model: &str,
    http_status: Option<u16>,
    latency_ms: u64,
    input_tokens: Option<u32>,
    output_tokens: Option<u32>,
)

Inserts a row into the attempt_log table.

`db::is_on_cooldown` / `db::set_cooldown`

pub fn is_on_cooldown(conn: &Connection, provider_model: &str) -> bool
pub fn set_cooldown(conn: &Connection, provider_model: &str, cooldown_until: &str)

Query and update the cooldown table. Timestamps are RFC 3339 strings.

`load_config`

pub fn load_config(path: &Path) -> Result<AppConfig, String>

Reads and parses the TOML configuration file.

Cooldown & Backoff Behavior

Scenario	Cooldown Duration
HTTP 429 with `retry-after` header	Value from header (seconds)
HTTP 429 without header	30 s (doubles per consecutive failure, max 1 h)
Other HTTP error (4xx/5xx)	30 s base, exponential doubling
Successful response	No cooldown set

Cooldowns are per entry (e.g., openai/gpt-4o can be on cooldown while openai/gpt-3.5-turbo stays active) and persisted in SQLite so separate CLI invocations share the same state.

Roadmap

Streaming response support
Configurable per-provider timeouts
Token budget limits per cascade
Retry with modified parameters (e.g., lower temperature)
Prometheus metrics export
Web dashboard for cooldown/attempt monitoring
Additional native providers (Mistral, Cohere, AWS Bedrock, Azure OpenAI)
Published crate on crates.io

Contributing

Contributions are welcome! This is an open-source project under the MIT license.

Getting started

Fork the repository
Clone your fork: git clone https://github.com/paluigi/llm-cascade.git
Create a branch: git checkout -b feature/your-feature
Build and test: cargo build && cargo clippy -- -D warnings

Making changes

Follow the existing code style (no comments unless necessary, concise naming)
Ensure cargo clippy -- -D warnings passes with zero warnings
Update this README if you change public API or configuration

Submitting

Push to your fork: git push origin feature/your-feature
Open a Pull Request against the main branch
Describe your changes and the motivation behind them

Reporting issues

Use the GitHub issue tracker to report bugs, request features, or ask questions. Please include:

Rust version (rustc --version)
OS and version
Minimal reproduction steps
Relevant log output (with RUST_LOG=debug)

llm-cascade 0.1.0

llm-cascade

Features

How It Works

Installation

From source

As a library dependency

Configuration

Provider types

API Keys

CLI Usage

Basic prompt

From a JSON conversation file

Custom config path

Output

All options

Verbosity

Library Usage

With tool definitions

Key types

API Reference

`run_cascade`

`db::init_db`

`db::log_attempt`

`db::is_on_cooldown` / `db::set_cooldown`

`load_config`

Cooldown & Backoff Behavior

Roadmap

Contributing

Getting started

Making changes

Submitting

Reporting issues

License

llm-cascade 0.1.0

llm-cascade

Features

How It Works

Installation

From source

As a library dependency

Configuration

Provider types

API Keys

CLI Usage

Basic prompt

From a JSON conversation file

Custom config path

Output

All options

Verbosity

Library Usage

With tool definitions

Key types

API Reference

run_cascade

db::init_db

db::log_attempt

db::is_on_cooldown / db::set_cooldown

load_config

Cooldown & Backoff Behavior

Roadmap

Contributing

Getting started

Making changes

Submitting

Reporting issues

License

`run_cascade`

`db::init_db`

`db::log_attempt`

`db::is_on_cooldown` / `db::set_cooldown`

`load_config`