ferryllm 0.1.3

Universal LLM protocol middleware for OpenAI, Anthropic, Claude Code, and OpenAI-compatible backends.
Documentation
# Benchmarking

ferryllm includes a benchmark-style load tester similar to `redis-benchmark`.
It is meant to measure your own server, machine, and deployment settings without spending real provider tokens.

## Start A Mock Upstream

```bash
MOCK_DELAY_MS=20 cargo run --example mock_openai_upstream --features http
```

The mock listens on `127.0.0.1:4010` and returns a small OpenAI-compatible chat response.

## Start ferryllm

In another shell:

```bash
export MOCK_OPENAI_API_KEY=dummy
cargo run --bin ferryllm -- serve --config examples/config/mock-openai.toml
```

This starts ferryllm on `127.0.0.1:3000` and routes all models to the mock upstream.

## Run The Benchmark Tool

The Rust benchmark example is the main tool for load testing:

```bash
cargo run --release --example load_test --features http -- \
  --preset mock-anthropic \
  --requests 10000 \
  --concurrency 512
```

Useful flags:

- `--preset mock-anthropic|mock-openai|anthropic-messages|openai-chat`
- `--protocol anthropic|openai`
- `--url`
- `--model`
- `--prompt`
- `--max-tokens`
- `--requests`
- `--concurrency`
- `--timeout`
- `--bearer`

Examples:

```bash
cargo run --release --example load_test --features http -- --protocol anthropic --requests 1000 --concurrency 128
cargo run --release --example load_test --features http -- --preset mock-openai --requests 1000 --concurrency 128
cargo run --release --example load_test --features http -- --protocol openai --url http://127.0.0.1:3000/v1/chat/completions --requests 1000 --concurrency 128
cargo run --release --example load_test --features http -- --requests 5000 --concurrency 512 --timeout 20
```

The benchmark prints JSON-like output with:

- `requests_per_second`
- status code counts
- error count
- latency min/mean/p50/p95/p99/max

## Notes

- Do not use the benchmark against paid providers unless you intentionally want to spend tokens.
- Increase `MOCK_DELAY_MS` to simulate slower upstreams.
- Use `max_concurrent_requests` and `rate_limit_per_minute` in the config to validate `429` behavior.
- For public deployments, run this on hardware close to the real deployment target.