Expand description
§inference-runtime-vllm
vLLM (Python) runtime — canonical local-LLM backend. Doc §2.1, §10.3.
§Feature flags
vllm— pull in PyO3 + theAsyncLLMEnginebridge. Without this feature the runner compiles to a typed-error stub so acargo build --features remote-onlyconsumer never pulls pyo3 / vllm / cudarc.gemma-default— adds the env probe + HuggingFace cache resolver + optionalhf-hubpre-download path so an operator can auto-provision a Gemma 4 deployment when the host has a workable GPU + Python + vLLM + HF token. Seeinference::defaults::gemmafor the rollup-side adapter.
§Lifecycle
VllmRunner::new is cheap and synchronous — it stores the config.
The Python AsyncLLMEngine is built lazily on the first
ModelRunner::execute call, so a runner can be instantiated
on hosts without a GPU (handy for config-layer tests).
Structs§
- Vllm
Config - vLLM engine configuration. Pass-through for the Python builder
arguments (
AsyncEngineArgs); the perf knobs at the bottom map 1:1 to vLLM’s own settings of the same name. - Vllm
Runner - vLLM runner. Constructs in O(1); the engine boots lazily on the
first call to
ModelRunner::execute.