ferryllm
Universal LLM protocol middleware for OpenAI, Anthropic, Claude Code, and OpenAI-compatible backends.
ferryllm is a Rust gateway that lets clients and providers speak different LLM protocols through one shared internal representation. Use it as a local Claude Code bridge, a private model gateway, or an embeddable adapter library.
Highlights
- OpenAI-compatible entrypoint:
POST /v1/chat/completions - Anthropic-compatible entrypoint:
POST /v1/messages - OpenAI-compatible and Anthropic backend adapters
- Claude Code to OpenAI-compatible backend routing
- Model aliases, prefix routing, and model rewrite rules
- Streaming SSE translation with tool-call support
- Config-driven standalone server:
ferryllm serve --config ferryllm.toml - Request timeout, body limit, API-key auth, rate limits, concurrency caps, metrics, retry, fallback, circuit breaker, and prompt cache support
- Library-first architecture for adding new entry protocols and provider adapters
Why
Most LLM gateways become an N x M matrix: every client protocol needs custom code for every provider protocol. ferryllm uses an N + M design instead.
Client protocol -> ferryllm IR -> provider protocol
That means a new backend adapter can immediately serve OpenAI-style clients, Anthropic-style clients, and Claude Code without rewriting every path.
Quick Start
Install from crates.io:
Or run from source:
Use an OpenAI-compatible provider key:
RUST_LOG=info
Smoke test the Anthropic-compatible endpoint:
Claude Code With GPT-5.4
Claude Code sends Anthropic-format requests. ferryllm can receive those requests, rewrite the model, and forward them to an OpenAI-compatible backend.
Claude Code
-> POST /v1/messages, model = claude-*
-> ferryllm Anthropic entry
-> unified IR
-> route match: claude-
-> rewrite backend model: gpt-5.4
-> OpenAI-compatible backend
Start ferryllm:
RUST_LOG=ferryllm=info,tower_http=info \
Point Claude Code at ferryllm:
ANTHROPIC_API_KEY=dummy \
ANTHROPIC_BASE_URL=http://127.0.0.1:3000 \
Expected output:
pong
See docs/claude-code.md for persistent Claude Code and cc-switch setup.
Configuration
ferryllm uses TOML configuration. Secrets stay in environment variables.
[]
= "0.0.0.0:3000"
= 120
= 32
# Optional. Uncomment to cap in-flight requests.
# max_concurrent_requests = 128
# Optional. Uncomment to cap total requests per minute.
# rate_limit_per_minute = 600
# Optional non-streaming upstream resilience. Streaming requests are not retried.
# retry_attempts = 2
# retry_backoff_ms = 100
# circuit_breaker_failures = 5
# circuit_breaker_cooldown_secs = 30
[]
= "info"
= "text"
[]
= false
# api_keys_env = "FERRYLLM_API_KEYS"
# Optional per-client caps, keyed by the authenticated API key.
# per_key_rate_limit_per_minute = 120
# per_key_max_concurrent_requests = 8
[]
= true
[]
# LiteLLM-style automatic Anthropic cache breakpoint injection.
# This preserves client-provided cache_control and adds ephemeral breakpoints
# on stable system/tools/last user blocks when they are missing.
= true
= true
= true
= true
# OpenAI-compatible prompt cache routing key. Keep it stable; do not include
# request-specific conversation text in this value.
= "ferryllm"
# Optional, if the upstream accepts it.
# openai_prompt_cache_retention = "24h"
# Safe long-running diagnostic: logs request structure, lengths, and hashes
# without logging prompt text or tool-result bodies.
= true
# Claude Code may place a volatile line near the beginning of system text.
# Move the full line intersecting this byte range into a later user context
# block so stable system instructions remain first for provider prompt caches.
= "0..1"
# Temporary investigation only: this prints the relocated prompt text.
# Disable after confirming the moved line.
= false
# Strip transport metadata lines from system before forwarding.
# Example:
# x-anthropic-billing-header: cc_version=2.1.128.9fd; cc_entrypoint=cli; cch=877c4;
# This helps move volatile Claude Code metadata such as cc_version and
# cc_entrypoint out of the stable system prefix.
= ["x-anthropic-billing-header:"]
[[]]
= "codexapis"
= "openai"
= "https://codexapis.com"
= "CODX_API_KEY"
[[]]
= "cc-gpt55"
= "exact"
= "codexapis"
= "gpt-5.4"
[[]]
= "claude-"
= "codexapis"
= "gpt-5.4"
[[]]
= "gpt-"
= "codexapis"
[[]]
= "grok-"
= "codexapis"
[[]]
= "*"
= "codexapis"
= "gpt-5.4"
Check a config without starting the server:
API Surface
| Endpoint | Purpose |
|---|---|
POST /v1/chat/completions |
OpenAI-compatible chat completions |
POST /v1/messages |
Anthropic-compatible messages |
GET /health |
Simple health check |
GET /healthz |
Kubernetes-style liveness check |
GET /readyz |
Readiness check |
GET /metrics |
Prometheus-style metrics |
Architecture
src/
adapter.rs Adapter trait
ir.rs Unified request, response, content, tool, and stream types
router.rs Exact and prefix model routing
server.rs Axum HTTP server
config.rs TOML config loader and validator
entry/ Client protocol translators
adapters/ Backend provider adapters
More detail: docs/architecture.md.
Load Testing
ferryllm ships a benchmark-style load tester for local mock-upstream testing:
MOCK_DELAY_MS=20
See docs/load-testing.md.
Documentation
- Chinese README
- Architecture
- Claude Code setup
- Configuration
- Compatibility notes
- Deployment
- Load testing
- Prompt caching and token observability
Prompt Cache
With prompt-observability enabled, ferryllm exposes prompt-cache usage in
logs and /metrics. In the current Claude Code + Codex relay setup, we have
observed cache read rates around 99.8% on stable prompts when the system prefix
is normalized and volatile transport metadata is stripped.
The exact result depends on the upstream provider, prompt shape, and how stable the prefix is across requests. See docs/prompt-caching.md for the cache-key rules and the current tuning knobs.
Roadmap
- More provider adapters, including Gemini
- Weighted and latency-aware provider pools
- Hot-reload configuration
- Richer Prometheus metrics labels
- Per-key quota and usage accounting hooks
- Packaged Docker images and deployment templates
License
MIT. See LICENSE.