onwards 0.19.2

A flexible LLM proxy library
Documentation

Onwards

Crates.io Documentation GitHub

A Rust-based AI Gateway that provides a unified interface for routing requests to OpenAI-compatible targets. The goal is to be as "transparent" as possible.

Read the full documentation

Quickstart

Create a config.json:

{
  "targets": {
    "gpt-4": {
      "url": "https://api.openai.com",
      "onwards_key": "sk-your-openai-key",
      "onwards_model": "gpt-4"
    }
  }
}

Start the gateway:

cargo run -- -f config.json

Send a request:

curl -X POST http://localhost:3000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Features

  • Unified routing to any OpenAI-compatible provider
  • Hot-reloading configuration with automatic file watching
  • Authentication with global and per-target API keys
  • Rate limiting and concurrency limiting (per-target and per-key)
  • Load balancing with weighted random and priority strategies
  • Automatic failover across multiple providers
  • Strict mode for request validation and error standardization
  • Response sanitization for OpenAI schema compliance
  • Prometheus metrics
  • Custom response headers

Performance Tuning

Connection Pooling

Onwards uses HTTP connection pooling to dramatically improve performance under load by reusing connections instead of creating new ones for each request. This eliminates the 1:1 request-to-file-descriptor ratio and prevents TIME_WAIT connection accumulation.

Configure via config file:

{
  "targets": {
    "gpt-4": {
      "url": "https://api.openai.com",
      "onwards_key": "sk-your-openai-key"
    }
  },
  "http_pool": {
    "max_idle_per_host": 100,
    "idle_timeout_secs": 90
  }
}

When to increase pool-max-idle-per-host:

The pool limit is applied per upstream host. Choose based on your deployment pattern:

Scenario 1: Fan-out (Multiple Upstreams)

Example: Main server routes to 10+ different model providers

  • Recommendation: max_idle_per_host: 200-300
  • Why: Traffic spreads across many upstreams, each gets moderate volume
  • Math: 10 providers × 200 connections = 2,000 total pooled connections
# Fan-out configuration
{
  "http_pool": {
    "max_idle_per_host": 300,
    "idle_timeout_secs": 90
  }
}

Scenario 2: Single Upstream (High Concurrency)

Example: Gateway in front of a single vLLM server handling all traffic

  • Recommendation: max_idle_per_host: 1000-2000
  • Why: ALL traffic goes to one host - needs high capacity to avoid creating new connections
  • Math: Peak 2000 concurrent requests → 2000 pooled connections reused across all requests
# Single upstream configuration
{
  "http_pool": {
    "max_idle_per_host": 2000,
    "idle_timeout_secs": 120
  }
}

Default values: If http_pool is omitted, defaults are 100 max idle connections per host and 90 second timeout.

Rule of thumb: Set pool-max-idle-per-host >= your expected peak concurrent requests per upstream host. If the pool is too small, new connections will be created beyond the pool limit, reducing the performance benefit.

Idle Timeout

Why 90 seconds is optimal:

The pool-idle-timeout-secs setting (default: 90s) controls how long idle connections stay in the pool:

  • HTTP/2 standard: Recommends keeping connections alive for 2+ minutes
  • Cloud load balancers: Typically timeout after 60-120 seconds
  • 90s balances:
    • ✅ Long enough to reuse connections between request bursts
    • ✅ Short enough to avoid holding stale connections upstream LBs have closed
    • ✅ Prevents connection leak from forgotten idle connections

When to adjust:

  • Increase to 120s for more aggressive connection reuse (bursty traffic with gaps)
  • Decrease to 60s if you see connection errors (upstream closing connections sooner)

Monitoring Connection Usage

# Monitor file descriptor usage (Linux/macOS)
lsof -p $(pgrep onwards) | wc -l

# Check connection states
ss -s  # Linux
netstat -an | grep TIME_WAIT | wc -l  # macOS/Linux

Expected improvements with connection pooling:

  • Before: 1000+ file descriptors for 200 concurrent requests (1:1 ratio)
  • After: 150-300 file descriptors for 200 concurrent requests (10:1 reuse ratio)
  • TIME_WAIT connections drop from thousands to near-zero

Documentation

Full documentation is available at doublewordai.github.io/onwards, covering: