atomr-infer-runtime-vllm
vLLM (Python) runtime — the canonical local LLM backend per architecture v4 §2.1.
Build profiles
| Build | Result |
|---|---|
cargo build -p atomr-infer-runtime-vllm (default) |
Stub — runner returns InferenceError::Internal("vllm feature disabled at build time"). Useful when Cargo.toml references the crate via the rollup but the host has no Python venv. |
cargo build -p atomr-infer-runtime-vllm --features vllm |
Pulls atomr-infer-python-bridge/python (PyO3 + tokio + parking_lot) and wires the runner over vLLM's EngineCore via the PythonGpuBridge. Phase-2a wiring lands as the surface stabilises upstream. |
Why a stub by default
vllm requires a working Python environment with the vllm package
installed at runtime. Linking against PyO3 at build time pulls
substantial deps and a python3 development install; we don't want
the workspace's default cargo build --workspace to fail on a host
without those. The stub keeps the trait satisfied and the rollup
buildable; flipping the feature swaps in the real path.
Configuration
use VllmConfig;
let cfg = VllmConfig ;
GIL story
vLLM is GIL-pinned. VllmRunner::gil_pinned() returns true, which
the placement actor reads to ensure two GIL-pinned deployments don't
share an interpreter. Each vLLM deployment gets its own
PythonGpuBridge instance (one interpreter per deployment by
default).
Roadmap
Per the architecture doc's §13 Phase 2a, the vLLM runner is wired
through atomr-infer-python-bridge once that bridge gets re-exported
from atomr-accel (planned F4). The current local PythonGpuBridge
implementation is a drop-in placeholder; the lift is mechanical.