# Configuration
This document describes the configuration model for the standalone ferryllm server.
ferryllm includes a config-file driven binary, so users can run the server without writing custom Rust code:
```bash
ferryllm serve --config ferryllm.toml
```
## Design Goals
- Keep secrets out of committed files.
- Make model routing explicit and auditable.
- Support multiple providers and model rewrite rules.
- Support local proxy use cases and production relay deployments.
- Allow advanced routing features such as fallback and weighted balancing later without breaking the basic format.
## Format
TOML is recommended for the first public server release.
Reasons:
- It is familiar in the Rust ecosystem.
- It is easy to read and review.
- It maps cleanly to structured configuration.
- It is less error-prone than large environment-variable based setups.
Environment variables should still be used for secrets and runtime overrides.
## Minimal Example
```toml
[server]
listen = "0.0.0.0:3000"
request_timeout_secs = 120
body_limit_mb = 32
max_concurrent_requests = 128
rate_limit_per_minute = 600
[logging]
level = "info"
format = "text"
[auth]
enabled = false
[metrics]
enabled = true
[[providers]]
name = "codexapis"
type = "openai"
base_url = "https://codexapis.com"
api_key_env = "CODX_API_KEY"
[[routes]]
match = "cc-gpt55"
match_type = "exact"
provider = "codexapis"
rewrite_model = "gpt-5.5"
[[routes]]
match = "claude-"
provider = "codexapis"
rewrite_model = "gpt-5.5"
fallback_providers = ["backup-openai"]
[[routes]]
match = "gpt-"
provider = "codexapis"
[[routes]]
match = "*"
provider = "codexapis"
rewrite_model = "gpt-5.5"
```
## Server Section
```toml
[server]
listen = "0.0.0.0:3000"
request_timeout_secs = 120
body_limit_mb = 32
max_concurrent_requests = 128
rate_limit_per_minute = 600
graceful_shutdown_secs = 30
```
Fields:
- `listen`: Address and port to bind.
- `request_timeout_secs`: Maximum request duration before ferryllm returns an error.
- `body_limit_mb`: Maximum request body size.
- `max_concurrent_requests`: Optional maximum number of in-flight OpenAI/Anthropic chat requests. Requests above the limit return `429`.
- `rate_limit_per_minute`: Optional global request rate cap. Requests above the limit return `429`.
- `graceful_shutdown_secs`: Time allowed for in-flight requests during shutdown.
## Logging Section
```toml
[logging]
level = "info"
format = "json"
```
Fields:
- `level`: One of `trace`, `debug`, `info`, `warn`, or `error`.
- `format`: `text` for local development, `json` for production log pipelines.
Expected logs should include:
- Incoming protocol entry.
- Incoming model name.
- Stream flag.
- Selected provider.
- Rewritten backend model.
- Upstream status.
- Upstream latency.
- Error category and message.
## Providers
```toml
[[providers]]
name = "openai"
type = "openai"
base_url = "https://api.openai.com"
api_key_env = "OPENAI_API_KEY"
```
Provider fields:
- `name`: Unique provider name used by routes.
- `type`: Adapter type, for example `openai` or `anthropic`.
- `base_url`: Provider base URL without endpoint path rewriting in routes.
- `api_key_env`: Environment variable containing the secret.
Provider-specific options can be added later:
```toml
connect_timeout_secs = 10
request_timeout_secs = 120
max_idle_connections = 256
```
## Routes
```toml
[[routes]]
match = "claude-"
provider = "codexapis"
rewrite_model = "gpt-5.5"
```
Route fields:
- `match`: Prefix match. `*` means catch-all.
- `match_type`: Optional. `prefix` by default, or `exact` for user-defined model aliases.
- `provider`: Provider name.
- `rewrite_model`: Optional backend model override.
- `fallback_providers`: Optional provider names tried in order for non-streaming requests when the primary provider fails.
Routes should be evaluated by longest-prefix match. This lets specific rules override broad defaults.
## User-Defined Model Aliases
Users can expose their own model names to clients and map those names to provider models.
Exact alias example:
```toml
[[routes]]
match = "cc-gpt55"
match_type = "exact"
provider = "codexapis"
rewrite_model = "gpt-5.5"
```
A client can now request `cc-gpt55`, while ferryllm sends `gpt-5.5` to the upstream provider.
Prefix mapping example:
```toml
[[routes]]
match = "claude-"
match_type = "prefix"
provider = "codexapis"
rewrite_model = "gpt-5.5"
```
This is useful for clients such as Claude Code, which send Anthropic model names even when the backend is not Anthropic.
## Fallback Routing
Simple fallback routing is supported for non-streaming requests:
```toml
[[providers]]
name = "primary"
type = "openai"
base_url = "https://primary.example.com"
api_key_env = "PRIMARY_API_KEY"
[[providers]]
name = "backup"
type = "openai"
base_url = "https://backup.example.com"
api_key_env = "BACKUP_API_KEY"
[[routes]]
match = "claude-"
provider = "primary"
rewrite_model = "gpt-5.5"
fallback_providers = ["backup"]
```
Fallbacks use the same rewritten backend model. Streaming fallback is intentionally not attempted after a stream has started.
## Future Route Strategies
The basic route format should leave room for advanced strategies.
Advanced fallback example:
```toml
[[routes]]
match = "claude-"
strategy = "fallback"
[[routes.targets]]
provider = "primary"
model = "gpt-5.5"
[[routes.targets]]
provider = "backup"
model = "gpt-5.4"
```
Weighted example:
```toml
[[routes]]
match = "gpt-"
strategy = "weighted"
[[routes.targets]]
provider = "provider-a"
weight = 80
[[routes.targets]]
provider = "provider-b"
weight = 20
```
## Authentication
For local usage, authentication can be disabled.
For public relay deployments, API key authentication should be enabled.
```toml
[auth]
enabled = true
api_keys_env = "FERRYLLM_API_KEYS"
per_key_rate_limit_per_minute = 120
per_key_max_concurrent_requests = 8
```
`FERRYLLM_API_KEYS` contains comma-separated keys:
```bash
export FERRYLLM_API_KEYS="key-one,key-two"
```
Clients can authenticate with either header:
```text
Authorization: Bearer key-one
x-api-key: key-one
```
Per-key limits are optional and only apply when authentication is enabled. They are tracked in-memory per server process and keyed by a hash of the authenticated API key, not by the raw key string.
## Prompt Cache
```toml
[prompt_cache]
auto_inject_anthropic_cache_control = true
cache_system = true
cache_tools = true
cache_last_user_message = true
openai_prompt_cache_key = "ferryllm"
# openai_prompt_cache_retention = "24h"
debug_log_request_shape = true
# relocate_system_prefix_range = "64..128"
# log_relocated_system_text = false
# strip_system_line_prefixes = ["x-anthropic-billing-header:"]
```
Fields:
- `auto_inject_anthropic_cache_control`: Preserve explicit Anthropic `cache_control`, and when missing, inject `{"type":"ephemeral"}` on stable cache breakpoints.
- `cache_system`: Add a breakpoint to top-level Anthropic system text.
- `cache_tools`: Add a breakpoint to the last Anthropic tool definition.
- `cache_last_user_message`: Add a breakpoint to the last cacheable block in the latest user message.
- `openai_prompt_cache_key`: Stable, low-cardinality key sent to OpenAI-compatible backends when supported.
- `openai_prompt_cache_retention`: Optional retention hint sent to OpenAI-compatible backends when supported.
- `debug_log_request_shape`: Log outbound request structure, lengths, and stable hashes without logging prompt text. Keep this enabled when diagnosing provider-side prompt cache misses.
- `relocate_system_prefix_range`: Optional `start..end` byte range. ferryllm moves the full system line intersecting this range into a user context block at the end of the message list, preserving the text while keeping stable prompt content first for provider prompt caches.
- `log_relocated_system_text`: Print the relocated text verbatim for diagnosis. This can expose prompt content; keep it disabled outside short investigations.
- `strip_system_line_prefixes`: Remove system lines that start with one of these prefixes and append them to trailing user context messages. Use this for transport metadata or other non-semantic boilerplate that should not affect cache prefix stability.
This follows LiteLLM-style prompt caching practice: cache stable prefixes, do not mark every block, and avoid injecting Anthropic-only metadata into OpenAI-compatible outbound requests.
## Metrics
```toml
[metrics]
enabled = true
```
When enabled, ferryllm exposes Prometheus-style counters at `/metrics`:
```text
ferryllm_requests_total
ferryllm_requests_ok_total
ferryllm_requests_error_total
ferryllm_upstream_errors_total
```
## Provider Resilience
Non-streaming upstream requests can be retried with exponential backoff:
```toml
[server]
retry_attempts = 2
retry_backoff_ms = 100
circuit_breaker_failures = 5
circuit_breaker_cooldown_secs = 30
```
`retry_attempts` is the number of retries after the first attempt. The default is `0`, which preserves fail-fast behavior. Streaming requests are not retried because ferryllm may already have started sending tokens to the client.
When `circuit_breaker_failures` is set, ferryllm tracks consecutive failures per provider. Once the threshold is reached, that provider is short-circuited until `circuit_breaker_cooldown_secs` has elapsed. Fallback providers can still be tried while the primary provider circuit is open.
## Validation Rules
The server should fail fast during startup when:
- A provider referenced by a route does not exist.
- A provider secret environment variable is missing.
- Two providers have the same name.
- A route has neither a provider nor a strategy.
- A timeout, body size, or weight value is invalid.
## Security Notes
- Never commit real API keys.
- Prefer `api_key_env` over inline `api_key`.
- Redact authorization headers in logs.
- Redact request bodies by default.
- Log model names, provider names, status codes, and latency instead of full prompts.