Skip to main content

Crate atomr_infer_runtime_vllm

Crate atomr_infer_runtime_vllm 

Source
Expand description

§inference-runtime-vllm

vLLM (Python) runtime — canonical local-LLM backend. Doc §2.1, §10.3.

§Feature flags

  • vllm — pull in PyO3 + the AsyncLLMEngine bridge. Without this feature the runner compiles to a typed-error stub so a cargo build --features remote-only consumer never pulls pyo3 / vllm / cudarc.
  • gemma-default — adds the env probe + HuggingFace cache resolver + optional hf-hub pre-download path so an operator can auto-provision a Gemma 4 deployment when the host has a workable GPU + Python + vLLM + HF token. See inference::defaults::gemma for the rollup-side adapter.

§Lifecycle

VllmRunner::new is cheap and synchronous — it stores the config. The Python AsyncLLMEngine is built lazily on the first ModelRunner::execute call, so a runner can be instantiated on hosts without a GPU (handy for config-layer tests).

Structs§

VllmConfig
vLLM engine configuration. Pass-through for the Python builder arguments (AsyncEngineArgs); the perf knobs at the bottom map 1:1 to vLLM’s own settings of the same name.
VllmRunner
vLLM runner. Constructs in O(1); the engine boots lazily on the first call to ModelRunner::execute.