llm-cascade 0.1.0

Resilient cascading LLM inference with automatic failover across multiple providers
Documentation

⚠️ Early Stage / Alpha Software

This project is in active early development. The API surface may change, and not all edge cases are handled yet. Contributions are welcome — see Contributing below. Use at your own risk.


llm-cascade

Resilient, cascading LLM inference across multiple providers — failover, circuit breaking, and retry cooldowns built in.

llm-cascade is a Rust library and CLI that sends prompts to an ordered list of LLM providers (OpenAI, Anthropic, Google Gemini, Ollama, and any OpenAI-compatible endpoint like Groq or Together). If one provider is rate-limited or down, it automatically falls through to the next, tracks per-entry cooldowns in SQLite, and persists failed prompts as JSON files.


Features

  • Cascading failover — define ordered provider/model lists; the first successful response wins
  • OpenAI-compatible providers — point any openai-type provider at a custom base_url (Groq, Together, Z.AI, vLLM, etc.)
  • Per-entry circuit breaker — cooldowns tracked per provider/model pair in SQLite
  • 429-aware backoff — parses retry-after headers; falls back to exponential backoff (30 s base, 1 h cap)
  • Cross-process state — cooldown state persists across CLI invocations via SQLite
  • Secret management — OS keyring (via keyring) with environment variable fallback
  • Failure persistence — total cascade failures saved as timestamped .json files
  • Full audit log — every attempt logged with timestamp, status, latency, and token counts
  • Dual interface — use as a CLI tool or as an async library in your own Rust projects

How It Works

┌──────────┐       ┌─────────────────────────────────────────────┐
│  CLI /   │       │              Cascade Engine                  │
│  Library │──────▶│                                             │
│  Caller  │       │  ┌───────────┐  ┌───────────┐  ┌─────────┐ │
└──────────┘       │  │ openai/   │─▶│ anthropic/│─▶│ ollama/ │ │
                   │  │ gpt-4o    │  │ claude…   │  │ llama3  │ │
                   │  └────┬──────┘  └────┬──────┘  └────┬────┘ │
                   │       │              │              │       │
                   │  ┌────▼──────────────▼──────────────▼────┐  │
                   │  │           SQLite Database              │  │
                   │  │  • attempt_log (audit trail)           │  │
                   │  │  • cooldown    (circuit breaker state)  │  │
                   │  └───────────────────────────────────────┘  │
                   │                                             │
                   │  On total failure:                          │
                   │  ┌───────────────────────────────────────┐  │
                    │  │  failed_prompts/cascade_20260414.json │  │
                   │  └───────────────────────────────────────┘  │
                   └─────────────────────────────────────────────┘
  1. Load config from ~/.config/llm-cascade/config.toml (or a custom path).
  2. Initialize SQLite — creates attempt_log and cooldown tables if missing.
  3. Iterate cascade entries — for each provider/model in the named cascade:
    • Check if the entry is on cooldown in the DB → skip if so.
    • Resolve the API key (keyring → env var).
    • Send the Conversation to the provider's API.
    • Log the attempt (status, latency, tokens).
    • On success → return LlmResponse immediately.
    • On failure → set cooldown (from retry-after header or exponential backoff) and continue.
  4. Total failure → persist the Conversation as a .json file, return CascadeError with the file path.

Installation

From source

git clone https://github.com/paluigi/llm-cascade.git
cd llm-cascade
cargo install --path .

As a library dependency

# Cargo.toml
[dependencies]
llm-cascade = "0.1"

Requires Rust 1.85+ (edition 2024).


Configuration

Copy the example config to the default location:

mkdir -p ~/.config/llm-cascade
cp config.example.toml ~/.config/llm-cascade/config.toml

Edit ~/.config/llm-cascade/config.toml:

# ── Provider Definitions ────────────────────────────────────
# Each block defines an endpoint (type, base_url, auth).
# Providers are referenced by name in cascades and can be
# reused with different models — no need to duplicate config.

[providers.openai]
type = "openai"
api_key_service = "openai"          # keyring entry name
api_key_env = "OPENAI_API_KEY"      # env var fallback
# base_url defaults to https://api.openai.com/v1

[providers.anthropic]
type = "anthropic"
api_key_service = "anthropic"
api_key_env = "ANTHROPIC_API_KEY"
# base_url defaults to https://api.anthropic.com

[providers.gemini]
type = "gemini"
api_key_service = "gemini"
api_key_env = "GOOGLE_API_KEY"
# base_url defaults to https://generativelanguage.googleapis.com

[providers.groq]
type = "openai"                     # reuse OpenAI-compatible protocol
base_url = "https://api.groq.com/openai/v1"
api_key_service = "groq"
api_key_env = "GROQ_API_KEY"

[providers.ollama]
type = "ollama"
base_url = "http://localhost:11434"
# No API key needed

# ── Cascades ───────────────────────────────────────────────
# Each entry references a provider by name and specifies a model.
# The same provider can appear multiple times with different models.

[cascades.creative_task]
entries = [
    { provider = "openai", model = "gpt-4o" },
    { provider = "anthropic", model = "claude-sonnet-4-20250514" },
    { provider = "gemini", model = "gemini-2.0-flash" },
]

[cascades.fast_task]
entries = [
    { provider = "ollama", model = "llama3" },
    { provider = "groq", model = "llama-3.3-70b-versatile" },
    { provider = "openai", model = "gpt-4o-mini" },
]

[cascades.resilient_task]
entries = [
    { provider = "openai", model = "gpt-4o" },
    { provider = "openai", model = "gpt-4o-mini" },       # same provider, different model
    { provider = "groq", model = "llama-3.3-70b-versatile" },
    { provider = "anthropic", model = "claude-sonnet-4-20250514" },
    { provider = "ollama", model = "llama3" },
]

# ── Persistence ────────────────────────────────────────────

[database]
path = "~/.local/share/llm-cascade/db.sqlite"

[failure_persistence]
dir = "~/.local/share/llm-cascade/failed_prompts"

Provider types

Type Description Default base_url
openai OpenAI Chat Completions API https://api.openai.com/v1
anthropic Anthropic Messages API https://api.anthropic.com
gemini Google Gemini generateContent API https://generativelanguage.googleapis.com
ollama Ollama local inference http://localhost:11434

Any provider with type = "openai" can be pointed at a custom base_url to use OpenAI-compatible services such as Groq, Together AI, Z.AI, vLLM, LiteLLM, etc.

API Keys

Method How it works
OS Keyring (preferred) Set via keyring CLI or your desktop's secret service. The api_key_service field is the keyring entry name.
Environment variable Export the variable named in api_key_env (e.g., export OPENAI_API_KEY=sk-...).
Ollama No API key needed for local models.

The library tries the keyring first and falls back to the environment variable automatically.


CLI Usage

Basic prompt

llm-cascade -C creative_task -p "Write a haiku about Rust"

From a JSON conversation file

llm-cascade -C creative_task -f conversation.json

The JSON file must match the Conversation schema:

{
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "What is 2 + 2?" }
  ],
  "tools": [
    {
      "name": "get_weather",
      "description": "Get the current weather",
      "parameters": {
        "type": "object",
        "properties": {
          "location": { "type": "string" }
        },
        "required": ["location"]
      }
    }
  ]
}

Custom config path

llm-cascade -c /path/to/my/config.toml -C fast_task -p "Hello"

Output

  • Text responses are printed to stdout.
  • Tool call responses are printed as pretty JSON to stdout.
  • Errors (including CascadeError with the .json file path) are printed to stderr with exit code 1.

All options

Usage: llm-cascade [OPTIONS] --cascade <CASCADE>

Options:
  -C, --cascade <CASCADE>  Name of the cascade to use (defined in config)
  -p, --prompt <PROMPT>    Text prompt to send
  -f, --file <FILE>        Path to a JSON file containing the conversation
  -c, --config <CONFIG>    Path to config file (default: ~/.config/llm-cascade/config.toml)
  -h, --help               Print help

Verbosity

Control log output via the RUST_LOG environment variable:

RUST_LOG=debug llm-cascade -C creative_task -p "Hello"
RUST_LOG=llm_cascade=trace llm-cascade -C creative_task -p "Hello"

Library Usage

Use llm-cascade as an async library in any Rust project:

use llm_cascade::{run_cascade, load_config, db, Conversation, Message, MessageRole};

#[tokio::main]
async fn main() {
    let config = load_config(&"config.toml".into()).expect("config");
    let conn = db::init_db(&config.database.path).expect("db");

    let conversation = Conversation::new(vec![
        Message::system("You are a concise assistant."),
        Message::user("What is the capital of France?"),
    ]);

    match run_cascade("creative_task", &conversation, &config, &conn).await {
        Ok(response) => {
            println!("Model: {}", response.model);
            println!("Response: {}", response.text_only());
            if let (Some(in), Some(out)) = (response.input_tokens, response.output_tokens) {
                println!("Tokens: {} in / {} out", in, out);
            }
        }
        Err(e) => {
            eprintln!("Cascade failed: {}", e);
        }
    }
}

With tool definitions

use llm_cascade::{run_cascade, load_config, db, Conversation, Message, ToolDefinition};
use serde_json::json;

let conversation = Conversation::new(vec![
    Message::user("What's the weather in Tokyo?"),
]).with_tools(vec![
    ToolDefinition {
        name: "get_weather".into(),
        description: "Get current weather for a location".into(),
        parameters: json!({
            "type": "object",
            "properties": {
                "location": { "type": "string" }
            },
            "required": ["location"]
        }),
    },
]);

let response = run_cascade("creative_task", &conversation, &config, &conn).await?;

Key types

Type Description
Conversation Holds messages: Vec<Message> and optional tools: Vec<ToolDefinition>
Message A single message with role (System/User/Assistant/Tool), content, and optional tool_call_id
ToolDefinition Tool name, description, and JSON Schema parameters
LlmResponse Response with content: Vec<ContentBlock>, token counts, and model name
ContentBlock Either Text { text } or ToolCall { id, name, arguments }
CascadeError Contains cascade name, error message, and absolute path to the persisted .json file
ProviderError HTTP status, body, optional retry_after seconds

API Reference

run_cascade

pub async fn run_cascade(
    cascade_name: &str,
    conversation: &Conversation,
    config: &AppConfig,
    conn: &Connection,
) -> Result<LlmResponse, CascadeError>

The core entry point. Iterates through the named cascade's provider entries, skipping those on cooldown, and returns the first successful LlmResponse.

db::init_db

pub fn init_db(path: &str) -> Result<Connection, String>

Opens (or creates) the SQLite database and ensures the schema exists. Expands ~ in the path.

db::log_attempt

pub fn log_attempt(
    conn: &Connection,
    cascade_name: &str,
    provider_model: &str,
    http_status: Option<u16>,
    latency_ms: u64,
    input_tokens: Option<u32>,
    output_tokens: Option<u32>,
)

Inserts a row into the attempt_log table.

db::is_on_cooldown / db::set_cooldown

pub fn is_on_cooldown(conn: &Connection, provider_model: &str) -> bool
pub fn set_cooldown(conn: &Connection, provider_model: &str, cooldown_until: &str)

Query and update the cooldown table. Timestamps are RFC 3339 strings.

load_config

pub fn load_config(path: &Path) -> Result<AppConfig, String>

Reads and parses the TOML configuration file.


Cooldown & Backoff Behavior

Scenario Cooldown Duration
HTTP 429 with retry-after header Value from header (seconds)
HTTP 429 without header 30 s (doubles per consecutive failure, max 1 h)
Other HTTP error (4xx/5xx) 30 s base, exponential doubling
Successful response No cooldown set

Cooldowns are per entry (e.g., openai/gpt-4o can be on cooldown while openai/gpt-3.5-turbo stays active) and persisted in SQLite so separate CLI invocations share the same state.


Roadmap

  • Streaming response support
  • Configurable per-provider timeouts
  • Token budget limits per cascade
  • Retry with modified parameters (e.g., lower temperature)
  • Prometheus metrics export
  • Web dashboard for cooldown/attempt monitoring
  • Additional native providers (Mistral, Cohere, AWS Bedrock, Azure OpenAI)
  • Published crate on crates.io

Contributing

Contributions are welcome! This is an open-source project under the MIT license.

Getting started

  1. Fork the repository
  2. Clone your fork: git clone https://github.com/paluigi/llm-cascade.git
  3. Create a branch: git checkout -b feature/your-feature
  4. Build and test: cargo build && cargo clippy -- -D warnings

Making changes

  • Follow the existing code style (no comments unless necessary, concise naming)
  • Ensure cargo clippy -- -D warnings passes with zero warnings
  • Update this README if you change public API or configuration

Submitting

  1. Push to your fork: git push origin feature/your-feature
  2. Open a Pull Request against the main branch
  3. Describe your changes and the motivation behind them

Reporting issues

Use the GitHub issue tracker to report bugs, request features, or ask questions. Please include:

  • Rust version (rustc --version)
  • OS and version
  • Minimal reproduction steps
  • Relevant log output (with RUST_LOG=debug)

License

MIT — Copyright (c) 2026 Luigi Palumbo