llm-edge-agent 0.1.0

Main LLM Edge Agent binary - High-performance LLM intercepting proxy
Documentation

llm-edge-agent

Crates.io Documentation License Rust

High-performance LLM intercepting proxy with intelligent caching, routing, and observability.

Overview

The llm-edge-agent binary is the main executable for the LLM Edge Agent system - an enterprise-grade intercepting proxy for Large Language Model APIs. It sits between your applications and LLM providers (OpenAI, Anthropic, etc.), providing intelligent request routing, multi-tier caching, cost optimization, and comprehensive observability.

Key Features:

  • High Performance: 1000+ RPS throughput, <50ms proxy overhead
  • Intelligent Caching: Multi-tier (L1 Moka + L2 Redis) with 70%+ hit rates
  • Smart Routing: Model-based, cost-optimized, latency-optimized, and failover strategies
  • Multi-Provider Support: OpenAI, Anthropic, with easy extensibility
  • Enterprise Observability: Prometheus metrics, Grafana dashboards, Jaeger tracing
  • Production Ready: Comprehensive testing, security hardening, chaos engineering validated

Installation

From Crates.io (when published)

cargo install llm-edge-agent

Building from Source

# Clone the repository
git clone https://github.com/globalbusinessadvisors/llm-edge-agent.git
cd llm-edge-agent

# Build the binary
cargo build --release --package llm-edge-agent

# The binary will be at: target/release/llm-edge-agent

Using Docker

# Pull the image (when published)
docker pull llm-edge-agent:latest

# Or build locally
docker build -t llm-edge-agent .

Quick Start

Prerequisites

At least one LLM provider API key is required:

  • OpenAI API key (for GPT models)
  • Anthropic API key (for Claude models)

Optional infrastructure:

  • Redis 7.0+ (for L2 distributed caching)
  • Prometheus (for metrics collection)
  • Grafana (for dashboards)
  • Jaeger (for distributed tracing)

Configuration

Create a .env file or set environment variables:

# Required: At least one provider API key
OPENAI_API_KEY=sk-your-openai-key
ANTHROPIC_API_KEY=sk-ant-your-anthropic-key

# Optional: Server configuration
HOST=0.0.0.0
PORT=8080
METRICS_PORT=9090

# Optional: L2 Cache (Redis)
ENABLE_L2_CACHE=true
REDIS_URL=redis://localhost:6379

# Optional: Observability
ENABLE_TRACING=true
ENABLE_METRICS=true
RUST_LOG=info,llm_edge_agent=debug

Running the Binary

Standalone (L1 cache only):

# Set API key
export OPENAI_API_KEY=sk-your-key

# Run the binary
llm-edge-agent

With full infrastructure (recommended):

# Start complete stack with Docker Compose
docker-compose -f docker-compose.production.yml up -d

# The agent will automatically connect to Redis, Prometheus, etc.

Check health:

curl http://localhost:8080/health

Making Your First Request

The proxy exposes an OpenAI-compatible API:

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-api-key" \
  -d '{
    "model": "gpt-3.5-turbo",
    "messages": [
      {
        "role": "user",
        "content": "Hello, world!"
      }
    ]
  }'

Response includes metadata:

{
  "id": "chatcmpl-...",
  "object": "chat.completion",
  "model": "gpt-3.5-turbo",
  "choices": [...],
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 15,
    "total_tokens": 25
  },
  "metadata": {
    "provider": "openai",
    "cached": false,
    "cache_tier": null,
    "latency_ms": 523,
    "cost_usd": 0.000125
  }
}

Usage

Environment Variables

Variable Default Description
HOST 0.0.0.0 Server bind address
PORT 8080 HTTP server port
METRICS_PORT 9090 Prometheus metrics port
OPENAI_API_KEY - OpenAI API key (required if using OpenAI)
ANTHROPIC_API_KEY - Anthropic API key (required if using Anthropic)
ENABLE_L2_CACHE false Enable Redis L2 cache
REDIS_URL - Redis connection URL
ENABLE_TRACING true Enable distributed tracing
ENABLE_METRICS true Enable Prometheus metrics
RUST_LOG info Logging configuration

API Endpoints

Main Proxy Endpoint:

  • POST /v1/chat/completions - OpenAI-compatible chat completions

Health & Monitoring:

  • GET /health - Detailed system health status
  • GET /health/ready - Kubernetes readiness probe
  • GET /health/live - Kubernetes liveness probe
  • GET /metrics - Prometheus metrics

Supported Models

OpenAI:

  • gpt-4, gpt-4-turbo, gpt-4o
  • gpt-3.5-turbo

Anthropic:

  • claude-3-opus, claude-3-sonnet, claude-3-haiku
  • claude-2.1, claude-2.0

The proxy automatically routes requests to the appropriate provider based on the model name.

Architecture

The binary integrates all LLM Edge Agent components:

llm-edge-agent (binary)
├── HTTP Server (Axum)
│   ├── Request validation
│   ├── Health check endpoints
│   └── Metrics endpoint
│
├── Cache Layer (llm-edge-cache)
│   ├── L1: In-memory (Moka)
│   └── L2: Distributed (Redis)
│
├── Routing Layer (llm-edge-routing)
│   ├── Model-based routing
│   ├── Cost optimization
│   ├── Latency optimization
│   └── Failover support
│
├── Provider Layer (llm-edge-providers)
│   ├── OpenAI adapter
│   └── Anthropic adapter
│
└── Observability (llm-edge-monitoring)
    ├── Prometheus metrics
    ├── Distributed tracing
    └── Structured logging

Deployment

Docker

# Build image
docker build -t llm-edge-agent .

# Run container
docker run -d \
  -p 8080:8080 \
  -p 9090:9090 \
  -e OPENAI_API_KEY=sk-your-key \
  -e ENABLE_L2_CACHE=false \
  --name llm-edge-agent \
  llm-edge-agent:latest

Docker Compose

See docker-compose.production.yml in the repository root for a complete production-ready stack including:

  • LLM Edge Agent (3 replicas)
  • Redis cluster (3 nodes)
  • Prometheus
  • Grafana (with pre-built dashboards)
  • Jaeger

Kubernetes

# Create namespace
kubectl create namespace llm-edge-production

# Create secrets
kubectl create secret generic llm-edge-secrets \
  --from-literal=openai-api-key="sk-..." \
  --from-literal=anthropic-api-key="sk-ant-..." \
  -n llm-edge-production

# Deploy
kubectl apply -f deployments/kubernetes/llm-edge-agent.yaml

Features:

  • HorizontalPodAutoscaler (3-10 replicas)
  • Rolling updates (zero downtime)
  • Resource limits and requests
  • Liveness and readiness probes

Monitoring

Metrics

The binary exposes Prometheus metrics on the configured METRICS_PORT:

Request Metrics:

  • llm_edge_requests_total - Total request count
  • llm_edge_request_duration_seconds - Request latency histogram
  • llm_edge_request_errors_total - Error count by type

Cache Metrics:

  • llm_edge_cache_hits_total{tier="l1|l2"} - Cache hits
  • llm_edge_cache_misses_total - Cache misses
  • llm_edge_cache_latency_seconds - Cache operation latency

Provider Metrics:

  • llm_edge_provider_latency_seconds - Provider response time
  • llm_edge_provider_errors_total - Provider errors
  • llm_edge_cost_usd_total - Cumulative cost tracking

Token Metrics:

  • llm_edge_tokens_used_total - Token usage by provider/model
  • llm_edge_tokens_prompt_total - Prompt tokens
  • llm_edge_tokens_completion_total - Completion tokens

Health Checks

Health endpoint response:

{
  "status": "healthy",
  "timestamp": "2025-01-08T12:00:00Z",
  "version": "1.0.0",
  "cache": {
    "l1_healthy": true,
    "l2_healthy": true,
    "l2_configured": true
  },
  "providers": {
    "openai": {
      "configured": true,
      "healthy": true
    },
    "anthropic": {
      "configured": true,
      "healthy": true
    }
  }
}

Performance

Benchmarks:

  • Throughput: 1000+ requests/second
  • Proxy Overhead: <50ms (P95)
  • L1 Cache Hit: <100μs
  • L2 Cache Hit: 1-2ms
  • Memory Usage: <2GB under normal load

Cache Performance:

  • Overall Hit Rate: >70%
  • L1 Hit Rate: 60-70% (hot data)
  • L2 Hit Rate: 10-15% (warm data)
  • Cost Savings: 70%+ (cached responses are free)

Documentation

For comprehensive documentation, see the root README:

  • Full architecture guide
  • Testing documentation
  • Infrastructure setup
  • Deployment guides
  • API reference

Development

This crate can be used both as a binary and as a library:

As a Binary:

cargo run --package llm-edge-agent

As a Library:

[dependencies]
llm-edge-agent = "1.0.0"
use llm_edge_agent::{AppConfig, initialize_app_state, handle_chat_completions};

#[tokio::main]
async fn main() {
    let config = AppConfig::from_env();
    let state = initialize_app_state(config).await.unwrap();

    // Use in your own Axum router
    // let app = Router::new()
    //     .route("/v1/chat/completions", post(handle_chat_completions))
    //     .with_state(Arc::new(state));
}

Troubleshooting

Provider not available:

Error: No providers configured

Solution: Set at least one API key (OPENAI_API_KEY or ANTHROPIC_API_KEY)

Redis connection failed:

Warning: L2 cache enabled but connection failed

Solution: Verify Redis is running and REDIS_URL is correct. Agent will fall back to L1-only mode.

High latency:

  • Check provider health: curl http://localhost:8080/health
  • Monitor metrics: curl http://localhost:9090/metrics
  • Review logs: Set RUST_LOG=debug

License

Licensed under the Apache License, Version 2.0. See LICENSE for details.

Contributing

See the Contributing Guide in the root repository.

Support