LLM Router
A high-performance, Rust-based load balancer and router specifically designed for Large Language Model (LLM) APIs. It intelligently distributes requests across multiple backend LLM API instances based on configurable strategies, health checks, and model capabilities.
Core Concepts
- Router: The central component that manages backend instances and routing logic.
- Instance: Represents a backend LLM API endpoint (e.g.,
https://api.openai.com/v1
). Each instance has an ID, base URL, health status, and associated models. - ModelInstanceConfig: Defines the specific models (e.g., "gpt-4", "text-embedding-ada-002") and their capabilities (Chat, Embedding, Completion) supported by an instance.
- Routing Strategy: Determines how the router selects the next instance for a request (Round Robin or Load Based).
- Health Checks: Periodically checks the availability of backend instances. Unhealthy instances are temporarily removed from rotation.
- Request Tracking: Automatically manages the count of active requests for each instance when using the
LoadBased
strategy. TheRequestTracker
utility simplifies this.
Features
- Multiple Routing Strategies: Load-based or Round Robin distribution.
- Automatic Health Checks: Continuously monitors backend health via configurable endpoints.
- Model Capability Support: Route requests based on the specific model and capability (chat, embedding, completion) required.
- Instance Timeout: Automatically quarantine instances that return errors for a configurable period.
- High Throughput: Efficiently handles thousands of instance selections per second.
- Low Overhead: Adds minimal latency (microseconds) to the instance selection process.
- Dynamic Instance Management: Add or remove backend instances at runtime without service interruption.
- Resilient Error Handling: Gracefully handles backend failures and timeouts.
Installation
Add the dependency to your Cargo.toml
:
[]
= "0.1.0"
# Add other dependencies for your application (e.g., tokio, reqwest, axum)
= { = "1", = ["full"] }
= { = "0.11", = ["json"] }
= "0.7"
= { = "1.0", = ["derive"] }
= "1.0"
Quick Start
use ;
use Duration;
use Arc; // Required for sharing Router
async
Authentication
Most LLM APIs require authentication (e.g., API keys). The llm-router
itself doesn't handle authentication headers directly during routing or health checks by default. You need to manage authentication in your application's HTTP client when making the actual API calls after selecting an instance.
If your health checks require authentication, you can provide a pre-configured reqwest::Client
to the Router::builder
.
use ;
use ;
use Duration;
use Arc;
async
// Remember: When making the actual API call *after* selecting an instance,
// you still need to ensure your request includes the necessary authentication.
// The client passed to the builder is primarily for health checks.
Detailed Usage Guide
Choosing a Routing Strategy
The router supports two strategies set via Router::builder().strategy(...)
:
-
Round Robin (
RoutingStrategy::RoundRobin
):- Distributes requests sequentially across all healthy, capable instances.
- Simple, predictable, and often provides the best raw throughput if backends are homogenous.
- Default strategy if not specified.
-
Load Based (
RoutingStrategy::LoadBased
):- Routes to the healthy, capable instance with the fewest currently active requests.
- Requires using
RequestTracker
(recommended) or manually callingincrement_request_count
anddecrement_request_count
to be effective. - Helps balance load when backend processing times vary significantly or when requests might be long-running.
- Can lead to more consistent latency across requests.
Managing Backend Instances (Dynamic Updates)
You can modify the router's instance pool after it has been built. This is useful for scaling or maintenance.
# use ;
# use Arc;
# use Duration;
#
# async
Selecting an Instance
The primary way to get a suitable backend URL is by requesting an instance for a specific model and capability.
# use ;
# use Arc;
# use Duration;
#
# async
Alternatively, if you don't need a specific model and just want the next instance according to the strategy:
# use ;
# use Arc;
# use Duration;
#
# async
Using RequestTracker
(Important for Load Balancing)
When using the LoadBased
strategy, the router needs to know how many requests are currently in flight to each instance. The RequestTracker
utility handles this automatically using RAII (Resource Acquisition Is Initialization).
# use ;
# use Arc;
# use Duration;
# async
#
# async
If you don't use RequestTracker
, you must manually call router.increment_request_count(&instance.id)
before the request and router.decrement_request_count(&instance.id)
after the request (including in error cases) for LoadBased
routing to function correctly. RequestTracker
is strongly recommended.
Handling Errors and Timeouts
If an API call to a selected instance fails, you might want to temporarily mark that instance as unhealthy to prevent routing further requests to it for a while.
# use ;
# use Arc;
# use Duration;
# async
#
# async
The instance will remain in the Timeout
state for the duration specified by instance_timeout_duration
in the builder, after which the health checker will attempt to bring it back online.
Health Checks Configuration
Configure health checks using the builder:
# use ;
# use Duration;
#
let router = builder
// ... other configurations ...
.health_check_path // The endpoint path for the health check (e.g., GET <base_url>/health)
.health_check_interval // Check health every 15 seconds
.health_check_timeout // Timeout for the health check request itself (5 seconds)
.instance_timeout_duration // How long an instance stays in Timeout state (60 seconds)
.build;
- If
health_check_path
is not set, instances are initially consideredHealthy
and only move toTimeout
iftimeout_instance
is called. - The health checker sends a
GET
request to<instance.base_url><health_check_path>
. A2xx
status code marks the instance asHealthy
. Any other status or a timeout marks it asUnhealthy
. - Instances in
Timeout
state are not checked until the timeout duration expires.
Axum Web Server Integration Example
Here's how to integrate llm-router
into an Axum web server to act as a proxy/gateway to your LLM backends.
use ;
use ;
use Client;
use ;
use SocketAddr;
use Arc;
use Duration;
// Define your request/response structures (matching the target LLM API)
// Application state
async
async
// Custom Error type for Axum handler
To run this Axum example:
- Make sure you have the dependencies in your
Cargo.toml
(axum
,tokio
,reqwest
,serde
,serde_json
,llm_router_core
). - Save the code as a Rust file (e.g.,
src/main.rs
). - Set the necessary environment variables (like
OPENAI_API_KEY
). - Run the application:
cargo run
- Send a POST request (e.g., using
curl
or Postman) tohttp://127.0.0.1:3000/v1/chat/completions
with a JSON body matching theChatRequest
structure.
Benchmarking
The crate includes benchmarks to measure the performance of the instance selection logic.
1. Run the Benchmarks:
Execute the standard Cargo benchmark command. This will run the selection logic repeatedly for different numbers of instances and routing strategies.
This command compiles the code in release mode with benchmarking enabled and runs the functions annotated with #[bench]
. The output will show time-per-iteration results, but it's easier to analyze with the reporter.
2. Generate the Report:
After running cargo bench
, the raw results are typically stored in target/criterion/
. A helper binary is provided to parse these results and generate a user-friendly report and a plot.
This command runs the bench_reporter
binary located in src/bin/bench_reporter.rs
. It will:
- Parse the benchmark results generated by
cargo bench
. - Print a summary table to the console showing the median selection time (in microseconds or nanoseconds) for each strategy and instance count combination.
- Generate simple text-based plots in the console.
- Save a graphical plot comparing the scaling of RoundRobin and LoadBased strategies to a file named
benchmark_scaling.png
in the project's root directory.
Example Output:
+------------+---------------+---------------------------+
| Strategy | Instances (N) | Median Time per Selection |
+------------+---------------+---------------------------+
| RoundRobin | 10 | 1.75 µs |
| LoadBased | 10 | 1.75 µs |
... (more rows) ...
| RoundRobin | 100 | 15.15 µs |
| LoadBased | 100 | 14.65 µs |
+------------+---------------+---------------------------+
Time (Median)
N=10 | 1.75 µs |
...
N=100 | 15.15 µs |
+------------------------------------------+
Instances (N) -->
... (LoadBased Plot) ...
Found result for RoundRobin/10: 1745.87 ns
Found result for RoundRobin/25: 3960.23 ns
... (more results) ...
Found result for LoadBased/100: 14648.41 ns
The benchmark_scaling.png
file provides a visual comparison of how the selection time increases as the number of backend instances grows for both routing strategies. This helps understand the minimal overhead added by the router.
Performance Considerations
- Selection Overhead: Benchmarks show that the core instance selection logic is extremely fast, typically taking only a few microseconds even with a hundred instances. This overhead is negligible compared to the network latency and processing time of actual LLM API calls.
- Throughput: The router itself is not typically the bottleneck. Throughput is limited by the capacity of your backend LLM instances and network conditions.
RoundRobin
vs.LoadBased
:RoundRobin
has slightly lower overhead as it doesn't need to check active request counts.LoadBased
provides better load distribution if backend performance varies, potentially leading to more consistent end-to-end latency, at the cost of slightly higher selection overhead (though still in microseconds).
- Health Checks: Health checks run in the background and do not block request routing. Ensure your health check endpoint is lightweight. Frequent or slow health checks can consume resources.
License
MIT