Crate atomr_infer_runtime_vllm

Expand description

§inference-runtime-vllm

vLLM (Python) runtime — canonical local-LLM backend. Doc §2.1, §10.3.

§Feature flags

vllm — pull in PyO3 + the AsyncLLMEngine bridge. Without this feature the runner compiles to a typed-error stub so a cargo build --features remote-only consumer never pulls pyo3 / vllm / cudarc.
gemma-default — adds the env probe + HuggingFace cache resolver + optional hf-hub pre-download path so an operator can auto-provision a Gemma 4 deployment when the host has a workable GPU + Python + vLLM + HF token. See inference::defaults::gemma for the rollup-side adapter.

§Lifecycle

VllmRunner::new is cheap and synchronous — it stores the config. The Python AsyncLLMEngine is built lazily on the first ModelRunner::execute call, so a runner can be instantiated on hosts without a GPU (handy for config-layer tests).

Structs§

VllmConfig: vLLM engine configuration. Pass-through for the Python builder arguments (AsyncEngineArgs); the perf knobs at the bottom map 1:1 to vLLM’s own settings of the same name.
VllmRunner: vLLM runner. Constructs in O(1); the engine boots lazily on the first call to ModelRunner::execute.

Crate atomr_infer_runtime_vllm

Crate atomr_infer_runtime_vllm Copy item path

§inference-runtime-vllm

§Feature flags

§Lifecycle

Structs§

Crate atomr_infer_runtime_vllm