ferryllm
Universal LLM protocol middleware for OpenAI, Anthropic, Claude Code, and OpenAI-compatible backends.
ferryllm is a Rust gateway that normalizes client and provider LLM protocols into one internal representation. Use it as a Claude Code bridge, a private model gateway, or an embeddable adapter layer.
What It Does
- Accepts OpenAI-compatible chat requests at
POST /v1/chat/completions - Accepts OpenAI Responses API requests at
POST /v1/responses - Accepts Anthropic-compatible messages at
POST /v1/messages - Rewrites model names with exact and prefix routing rules
- Forwards to OpenAI-compatible, OpenAI Responses, Anthropic, or optional Gemini backend adapters
- Preserves tool calls and SSE streaming behavior
- Keeps prompt-cache keys stable while stripping transport metadata
- Maps reasoning control through the IR and provider adapters
Why ferryllm
Most gateways end up as an N x M matrix: every client protocol needs custom code for every provider protocol.
ferryllm uses N + M routing instead:
Client protocol -> ferryllm IR -> provider protocol
That makes it easier to:
- put Claude Code behind a stable backend
- expose one gateway to multiple client protocols
- keep cache behavior predictable
- add new providers without rewriting every client path
Fast Start
Install from crates.io:
This installs the ferryllm CLI only. Native desktop GUI builds are distributed
through GitHub Releases.
Run from source:
Set the provider key and start the server:
RUST_LOG=info
Smoke test the Anthropic-compatible endpoint:
Claude Code Bridge
Claude Code sends Anthropic-format requests. ferryllm can receive those requests, rewrite the model, and forward them to an OpenAI-compatible backend.
Claude Code
-> POST /v1/messages, model = claude-*
-> ferryllm Anthropic entry
-> unified IR
-> route match: claude-
-> rewrite backend model: gpt-5.4
-> OpenAI-compatible backend
Start ferryllm:
RUST_LOG=ferryllm=info,tower_http=info \
Point Claude Code at ferryllm:
ANTHROPIC_API_KEY=dummy \
ANTHROPIC_BASE_URL=http://127.0.0.1:3000 \
Expected output:
pong
Desktop GUI
ferryllm also ships a native Tauri desktop control panel for editing configuration, starting and stopping the local gateway, validating configs, and launching Codex or Claude with the right local endpoint environment.

Install the desktop app from GitHub Releases:
- Windows: download and run the
.exeor.msiinstaller. - macOS: download and open the
.dmg. - Linux: download and install the
.deb.
After opening the app, configure a provider, save the config, and click
Start. The GUI runs:
The packaged app first looks for the bundled ferryllm sidecar, then falls back
to a ferryllm executable on PATH. Launch CLI and VS Code start Codex or
Claude with OPENAI_BASE_URL, ANTHROPIC_BASE_URL, or GEMINI_BASE_URL
pointing at the local gateway.


See docs/claude-code.md for persistent Claude Code and cc-switch setup.
Configuration
ferryllm uses TOML config. Secrets stay in environment variables.
[]
= "0.0.0.0:3000"
= 120
= 32
= "medium"
# Optional. Uncomment to cap in-flight requests.
# max_concurrent_requests = 128
# Optional. Uncomment to cap total requests per minute.
# rate_limit_per_minute = 600
# Optional non-streaming upstream resilience. Streaming requests are not retried.
# retry_attempts = 2
# retry_backoff_ms = 100
# circuit_breaker_failures = 5
# circuit_breaker_cooldown_secs = 30
[]
= "info"
= "text"
[]
= false
# api_keys_env = "FERRYLLM_API_KEYS"
# Optional per-client caps, keyed by the authenticated API key.
# per_key_rate_limit_per_minute = 120
# per_key_max_concurrent_requests = 8
[]
= true
[]
= true
= true
= true
= true
= "ferryllm"
# openai_prompt_cache_retention = "24h"
= true
= "0..1"
= false
= ["x-anthropic-billing-header:"]
[[]]
= "codexapis"
# Default path for controlling reasoning effort today.
= "openai_responses"
= "https://codexapis.com"
= "CODX_API_KEY"
# If you want the legacy Chat Completions path instead, switch this back to:
# type = "openai"
# Or use key_watch for hot-reload from external config files:
# [[providers.key_watch]]
# file = "C:/Users/hzz/.claude/settings.json"
# path = "env.ANTHROPIC_AUTH_TOKEN"
[[]]
= "cc-gpt55"
= "exact"
= "codexapis"
= "gpt-5.4"
[[]]
= "claude-"
= "codexapis"
= "gpt-5.4"
[[]]
= "gpt-"
= "codexapis"
[[]]
= "grok-"
= "codexapis"
[[]]
= "*"
= "codexapis"
= "gpt-5.4"
Check a config without starting the server:
For hot-reload API key configuration (e.g., from cc-switch settings), see the key_watch section in the configuration docs.
To route OpenAI-compatible upstream calls through the Responses API instead of
Chat Completions, use provider type openai_responses. Default builds,
including cargo install ferryllm, include this adapter. If you build with
--no-default-features, add the openai-responses feature explicitly:
[[]]
= "codexapis"
= "openai_responses"
= "https://codexapis.com"
= "CODX_API_KEY"
See examples/config/codexapis-responses.toml.
Reasoning Effort
Set the default model reasoning depth in TOML:
[]
= "medium"
Valid values are none, low, medium, high, xhigh, and x_high.
This default is applied only when the client request does not already include an explicit reasoning or thinking control. For Claude Code today, changing this in TOML is the practical way to control the forwarded OpenAI-compatible reasoning.effort.
Run with debug logging and look for reasoning=effort=... in the outbound request-shape log to confirm what ferryllm sent upstream.
API Surface
| Endpoint | Purpose |
|---|---|
POST /v1/chat/completions |
OpenAI-compatible chat completions |
POST /v1/responses |
OpenAI Responses API |
POST /responses |
Responses API compatibility alias |
POST /v1/messages |
Anthropic-compatible messages |
GET /v1/models |
OpenAI-compatible model listing |
GET /health |
Simple health check |
GET /healthz |
Kubernetes-style liveness check |
GET /readyz |
Readiness check |
GET /metrics |
Prometheus-style metrics with per-provider/model labels |
Prompt Cache
ferryllm keeps prompt-cache keys stable while stripping transport metadata and normalizing the prompt prefix.
With prompt-observability enabled, ferryllm logs prompt-cache usage and exposes it through /metrics.
For Claude Code deployments, the important knobs are:
relocate_system_prefix_rangestrip_system_line_prefixesopenai_prompt_cache_keydefault_reasoning_effort
See docs/prompt-caching.md and docs/reasoning-control.md.
Architecture
src/
adapter.rs Adapter trait
ir.rs Unified request, response, content, tool, and stream types
router.rs Exact and prefix model routing
server.rs Axum HTTP server
config.rs TOML config loader and validator
entry/ Client protocol translators
adapters/ Backend provider adapters
More detail: docs/architecture.md.
Load Testing
ferryllm ships a benchmark-style load tester for local mock-upstream testing:
MOCK_DELAY_MS=20
See docs/load-testing.md.
Documentation
- Chinese README
- Architecture
- Claude Code setup
- Configuration
- Compatibility notes
- Deployment
- Load testing
- Prompt caching and token observability
- Reasoning control
Roadmap
- More provider adapters and provider-specific tuning
- Weighted and latency-aware provider pools
- Full config hot reload without managed-process restart
- Richer Prometheus metrics dimensions
- Per-key quota and usage accounting hooks
- Packaged Docker images and deployment templates
License
MIT. See LICENSE.