Inference Lab
LLM inference simulator for analyzing serving systems. Simulates GPU clusters serving LLM inference workloads with realistic performance modeling.
Features
- Accurate Performance Modeling: Models compute (FLOPS) and memory bandwidth constraints
- Multiple Scheduling Policies: FCFS, Priority, SJF, and more
- Chunked Prefill: Simulates realistic request interleaving
- KV Cache Management: Models GPU memory and KV cache utilization
- Workload Generation: Supports Poisson, Gamma, and closed-loop patterns
- WebAssembly Support: Run simulations in the browser via WASM
- CLI Tool: Standalone binary for command-line usage
How does it work?
inference-lab uses discrete-event simulation to model the behavior of a
multi-GPU node serving LLM inference requests with the vLLM library. It
contains a facsimile of the vLLM queueing, scheduling, and execution logic,
with only the actual model inference replaced by a performance model based on
the supplied GPU specs and model architecture.
Within each simulation step, the simulator:
- Processes any newly arrived requests, adding them to the scheduling queue.
- Schedules requests to serve based on the selected scheduling policy.
- Calculates the compute and memory bandwidth usage for the workload that the scheduled requests represent, and the theoretical time required to execute the workload on the specified hardware.
- Increments the simulation time by the calculated execution time, updating the state of all requests accordingly.
Caveats:
- This assumes perfectly optimized GPU execution, ignoring kernel launch overheads, poorly optimized kernels, application overhead, thermals, etc.
- We simulate tensor parallel execution, but don't model multi-GPU communication overheads.
Installation
As a Rust Library
As an npm Package (WASM)
CLI Tool
Usage
CLI
Note: The CLI tool is only available if you install it using cargo install inference-lab (see above).
# Run with default configuration
# Example output shows TTFT, E2E latency, throughput, and utilization metrics
Rust Library
use Simulator;
use SimulationConfig;
let config = from_file?;
let mut simulator = new;
let results = simulator.run;
println!;
println!;
println!;
WebAssembly
import init from '@doubleword/inference-lab';
await ;
const config = ;
const results = ;
console.log;
console.log;
Configuration
Configuration files use TOML format and specify:
- Hardware: GPU specs (FLOPS, bandwidth, VRAM)
- Model: LLM architecture (parameters, layers, heads)
- Scheduler: Policies, max tokens, chunked prefill settings
- Workload: Request arrival patterns and distributions
Example configurations are in the configs/ directory:
config.toml- Default H100 + Llama-3-70B setuptest_blog.toml- Closed-loop benchmark (64 users)qwen3_30b_a3b.toml- Qwen model configuration
Building
Native Binary
WASM Package
# Outputs to pkg/ directory
Publishing
# Publish to npm (requires authentication)
# Publish Rust crate
Project Structure
inference-lab/
├── src/
│ ├── simulation/ # Core simulator logic
│ ├── scheduler/ # Scheduling policies (FCFS, Priority, SJF)
│ ├── compute/ # Performance calculations
│ ├── kv_cache/ # KV cache management
│ ├── request/ # Request generation and tracking
│ ├── metrics/ # Performance metrics collection
│ ├── config/ # Configuration structures
│ ├── lib.rs # Library root
│ ├── main.rs # CLI entry point
│ └── wasm.rs # WebAssembly bindings
├── configs/ # Example configurations
├── Cargo.toml # Rust package manifest
└── package.json # npm package manifest
Metrics
The simulator tracks:
- TTFT (Time to First Token): Prefill latency
- E2E (End-to-End): Total request latency
- TPOT (Time Per Output Token): Decode latency per token
- Throughput: Tokens generated per second
- Utilization: Compute and memory bandwidth usage
- KV Cache: Memory utilization over time
Results include percentiles (p50, p90, p95, p99) and means.
License
MIT