ferryllm 0.1.5

Universal LLM protocol middleware for OpenAI, Anthropic, Claude Code, and OpenAI-compatible backends.
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
# Configuration


This document describes the configuration model for the standalone ferryllm server.

ferryllm includes a config-file driven binary, so users can run the server without writing custom Rust code:

```bash
ferryllm serve --config ferryllm.toml
```

## Design Goals


- Keep secrets out of committed files.
- Make model routing explicit and auditable.
- Support multiple providers and model rewrite rules.
- Support local proxy use cases and production relay deployments.
- Allow advanced routing features such as fallback and weighted balancing later without breaking the basic format.

## Format


TOML is recommended for the first public server release.

Reasons:

- It is familiar in the Rust ecosystem.
- It is easy to read and review.
- It maps cleanly to structured configuration.
- It is less error-prone than large environment-variable based setups.

Environment variables should still be used for secrets and runtime overrides.

## Minimal Example


```toml
[server]
listen = "0.0.0.0:3000"
request_timeout_secs = 120
body_limit_mb = 32
max_concurrent_requests = 128
rate_limit_per_minute = 600
default_reasoning_effort = "medium"

[logging]
level = "info"
format = "text"

[auth]
enabled = false

[metrics]
enabled = true

[[providers]]
name = "codexapis"
type = "openai"
base_url = "https://codexapis.com"
api_key_env = "CODX_API_KEY"

[[routes]]
match = "cc-gpt55"
match_type = "exact"
provider = "codexapis"
rewrite_model = "gpt-5.5"

[[routes]]
match = "claude-"
provider = "codexapis"
rewrite_model = "gpt-5.5"
fallback_providers = ["backup-openai"]

[[routes]]
match = "gpt-"
provider = "codexapis"

[[routes]]
match = "*"
provider = "codexapis"
rewrite_model = "gpt-5.5"
```

## Server Section


```toml
[server]
listen = "0.0.0.0:3000"
request_timeout_secs = 120
body_limit_mb = 32
max_concurrent_requests = 128
rate_limit_per_minute = 600
graceful_shutdown_secs = 30
```

Fields:

- `listen`: Address and port to bind.
- `request_timeout_secs`: Maximum request duration before ferryllm returns an error.
- `body_limit_mb`: Maximum request body size.
- `max_concurrent_requests`: Optional maximum number of in-flight OpenAI/Anthropic chat requests. Requests above the limit return `429`.
- `rate_limit_per_minute`: Optional global request rate cap. Requests above the limit return `429`.
- `graceful_shutdown_secs`: Time allowed for in-flight requests during shutdown.
- `default_reasoning_effort`: Optional default reasoning effort applied only when the client does not send an explicit reasoning or thinking control. Valid values are `none`, `low`, `medium`, `high`, `xhigh`, and `x_high`.

`default_reasoning_effort` is a control-plane setting. It is not included in ferryllm's prompt cache key.

## Logging Section


```toml
[logging]
level = "info"
format = "json"
```

Fields:

- `level`: One of `trace`, `debug`, `info`, `warn`, or `error`.
- `format`: `text` for local development, `json` for production log pipelines.

Expected logs should include:

- Incoming protocol entry.
- Incoming model name.
- Stream flag.
- Selected provider.
- Rewritten backend model.
- Upstream status.
- Upstream latency.
- Error category and message.

## Providers


```toml
[[providers]]
name = "openai"
type = "openai"
base_url = "https://api.openai.com"
api_key_env = "OPENAI_API_KEY"
```

When built with `--features openai-responses`, an OpenAI-compatible provider can
also send upstream requests to `/v1/responses`:

```toml
[[providers]]
name = "openai-responses"
type = "openai_responses"
base_url = "https://api.openai.com"
api_key_env = "OPENAI_API_KEY"
```

If you want the Responses path to be the default example, use
`type = "openai_responses"` in your config. Keep the legacy `openai` variant as
the commented fallback if you need the older Chat Completions path.

Provider fields:

- `name`: Unique provider name used by routes.
- `type`: Adapter type, for example `openai`, `openai_responses`, or `anthropic`.
- `base_url`: Provider base URL without endpoint path rewriting in routes.
- `api_key_env`: Environment variable containing the secret.

Provider-specific options can be added later:

```toml
connect_timeout_secs = 10
request_timeout_secs = 120
max_idle_connections = 256
```

## Key Watch (Hot-Reload API Keys)


The `key_watch` feature allows ferryllm to automatically reload API keys when configuration files change. This is particularly useful when using tools like cc-switch that manage API keys in external files.

```toml
[[providers]]
name = "my-provider"
type = "anthropic"
base_url = "https://api.anthropic.com"

[[providers.key_watch]]
file = "C:/Users/hzz/.claude/settings.json"
path = "env.ANTHROPIC_AUTH_TOKEN"
```

**Fields:**

- `file`: Path to the configuration file (supports JSON and TOML formats)
- `path`: Dotted JSON path to the key value (e.g., `env.ANTHROPIC_AUTH_TOKEN` for Anthropic keys, or `OPENAI_API_KEY` for OpenAI keys)

The provider attempts each key_watch entry in order and uses the first non-empty key found. When the watched file changes, ferryllm automatically reloads the key without requiring a server restart.

**File formats:**

JSON file (`~/.claude/settings.json`):
```json
{
  "env": {
    "ANTHROPIC_AUTH_TOKEN": "sk-ant-..."
  }
}
```

TOML file (`~/.codex/auth.toml`):
```toml
OPENAI_API_KEY = "sk-..."
ANTHROPIC_AUTH_TOKEN = "sk-ant-..."
```

You can also watch multiple files for the same provider. This is useful when switching between different configuration sources:

```toml
[[providers]]
name = "my-provider"
type = "anthropic"
base_url = "https://api.anthropic.com"

[[providers.key_watch]]
file = "C:/Users/hzz/.claude/settings.json"
path = "env.ANTHROPIC_AUTH_TOKEN"

[[providers.key_watch]]
file = "C:/Users/hzz/.codex/auth.json"
path = "ANTHROPIC_AUTH_TOKEN"
```

**Notes:**

- A provider can only use one key source: `api_key_env`, `api_key_url`, `api_key_file`, or `key_watch`
- When `key_watch` is used, the server logs the key prefix (first 8 characters) for debugging
- File change events are logged at trace/debug level

## Routes


```toml
[[routes]]
match = "claude-"
provider = "codexapis"
rewrite_model = "gpt-5.5"
```

Route fields:

- `match`: Prefix match. `*` means catch-all.
- `match_type`: Optional. `prefix` by default, or `exact` for user-defined model aliases.
- `provider`: Provider name.
- `rewrite_model`: Optional backend model override.
- `fallback_providers`: Optional provider names tried in order for non-streaming requests when the primary provider fails.

Routes should be evaluated by longest-prefix match. This lets specific rules override broad defaults.

## User-Defined Model Aliases


Users can expose their own model names to clients and map those names to provider models.

Exact alias example:

```toml
[[routes]]
match = "cc-gpt55"
match_type = "exact"
provider = "codexapis"
rewrite_model = "gpt-5.5"
```

A client can now request `cc-gpt55`, while ferryllm sends `gpt-5.5` to the upstream provider.

Prefix mapping example:

```toml
[[routes]]
match = "claude-"
match_type = "prefix"
provider = "codexapis"
rewrite_model = "gpt-5.5"
```

This is useful for clients such as Claude Code, which send Anthropic model names even when the backend is not Anthropic.

## Fallback Routing


Simple fallback routing is supported for non-streaming requests:

```toml
[[providers]]
name = "primary"
type = "openai"
base_url = "https://primary.example.com"
api_key_env = "PRIMARY_API_KEY"

[[providers]]
name = "backup"
type = "openai"
base_url = "https://backup.example.com"
api_key_env = "BACKUP_API_KEY"

[[routes]]
match = "claude-"
provider = "primary"
rewrite_model = "gpt-5.5"
fallback_providers = ["backup"]
```

Fallbacks use the same rewritten backend model. Streaming fallback is intentionally not attempted after a stream has started.

## Future Route Strategies


The basic route format should leave room for advanced strategies.

Advanced fallback example:

```toml
[[routes]]
match = "claude-"
strategy = "fallback"

[[routes.targets]]
provider = "primary"
model = "gpt-5.5"

[[routes.targets]]
provider = "backup"
model = "gpt-5.4"
```

Weighted example:

```toml
[[routes]]
match = "gpt-"
strategy = "weighted"

[[routes.targets]]
provider = "provider-a"
weight = 80

[[routes.targets]]
provider = "provider-b"
weight = 20
```

## Authentication


For local usage, authentication can be disabled.

For public relay deployments, API key authentication should be enabled.

```toml
[auth]
enabled = true
api_keys_env = "FERRYLLM_API_KEYS"
per_key_rate_limit_per_minute = 120
per_key_max_concurrent_requests = 8
```

`FERRYLLM_API_KEYS` contains comma-separated keys:

```bash
export FERRYLLM_API_KEYS="key-one,key-two"
```

Clients can authenticate with either header:

```text
Authorization: Bearer key-one
x-api-key: key-one
```

Per-key limits are optional and only apply when authentication is enabled. They are tracked in-memory per server process and keyed by a hash of the authenticated API key, not by the raw key string.

## Prompt Cache


```toml
[prompt_cache]
auto_inject_anthropic_cache_control = true
cache_system = true
cache_tools = true
cache_last_user_message = true
openai_prompt_cache_key = "ferryllm"
# openai_prompt_cache_retention = "24h"

debug_log_request_shape = true
# relocate_system_prefix_range = "64..128"

# log_relocated_system_text = false

# strip_system_line_prefixes = ["x-anthropic-billing-header:"]

```

Fields:

- `auto_inject_anthropic_cache_control`: Preserve explicit Anthropic `cache_control`, and when missing, inject `{"type":"ephemeral"}` on stable cache breakpoints.
- `cache_system`: Add a breakpoint to top-level Anthropic system text.
- `cache_tools`: Add a breakpoint to the last Anthropic tool definition.
- `cache_last_user_message`: Add a breakpoint to the last cacheable block in the latest user message.
- `openai_prompt_cache_key`: Stable, low-cardinality key sent to OpenAI-compatible backends when supported.
- `openai_prompt_cache_retention`: Optional retention hint sent to OpenAI-compatible backends when supported.
- `debug_log_request_shape`: Log outbound request structure, lengths, and stable hashes without logging prompt text. Keep this enabled when diagnosing provider-side prompt cache misses.
- `relocate_system_prefix_range`: Optional `start..end` byte range. ferryllm moves the full system line intersecting this range into a user context block at the end of the message list, preserving the text while keeping stable prompt content first for provider prompt caches.
- `log_relocated_system_text`: Print the relocated text verbatim for diagnosis. This can expose prompt content; keep it disabled outside short investigations.
- `strip_system_line_prefixes`: Drop system lines that start with one of these prefixes. Use this for transport metadata or other non-semantic boilerplate that should not affect cache prefix stability or reach the model as user content. A common Claude Code example is `x-anthropic-billing-header: cc_version=...; cc_entrypoint=...;`.

This follows LiteLLM-style prompt caching practice: cache stable prefixes, do not mark every block, and avoid injecting Anthropic-only metadata into OpenAI-compatible outbound requests.

## Metrics


```toml
[metrics]
enabled = true
```

When enabled, ferryllm exposes Prometheus-style counters at `/metrics`:

```text
ferryllm_requests_total
ferryllm_requests_ok_total
ferryllm_requests_error_total
ferryllm_upstream_errors_total
```

## Provider Resilience


Non-streaming upstream requests can be retried with exponential backoff:

```toml
[server]
retry_attempts = 2
retry_backoff_ms = 100
circuit_breaker_failures = 5
circuit_breaker_cooldown_secs = 30
```

`retry_attempts` is the number of retries after the first attempt. The default is `0`, which preserves fail-fast behavior. Streaming requests are not retried because ferryllm may already have started sending tokens to the client.

When `circuit_breaker_failures` is set, ferryllm tracks consecutive failures per provider. Once the threshold is reached, that provider is short-circuited until `circuit_breaker_cooldown_secs` has elapsed. Fallback providers can still be tried while the primary provider circuit is open.

## Validation Rules


The server should fail fast during startup when:

- A provider referenced by a route does not exist.
- A provider secret environment variable is missing.
- Two providers have the same name.
- A route has neither a provider nor a strategy.
- A timeout, body size, or weight value is invalid.

## Security Notes


- Never commit real API keys.
- Prefer `api_key_env` over inline `api_key`.
- Redact authorization headers in logs.
- Redact request bodies by default.
- Log model names, provider names, status codes, and latency instead of full prompts.