Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
Dynamo Rust Backend (dynamo-backend-common)
Work in progress. The unified backend supports aggregated and disaggregated (prefill/decode) inference; multimodal, LoRA, logprobs, guided decoding, and engine-level metrics are still on the non-unified path. The Python
Worker(dynamo.common.backend) is a thin shim over this crate.
Looking for a walkthrough? Start with Writing a Rust Unified Backend. This README is the in-tree reference: trait shape, file layout, disaggregation contract, error taxonomy, and the conformance kit.
A two-type abstraction that separates runtime integration (common across all backends) from engine logic (vLLM, SGLang, TRT-LLM, your custom engine, etc.).
Architecture
LLMEngine (trait) <-- engine boundary (engine.rs)
| - start(worker_id) -> Result<EngineConfig, DynamoError>
| - generate(request, ctx) -> Result<BoxStream<...>, DynamoError>
| - abort(ctx) (optional, default no-op)
| - drain() -> Result<(), DynamoError> (optional, default no-op)
| - cleanup() -> Result<(), DynamoError>
|
+-- MockerBackend <-- examples/mocker/src/engine.rs
+-- <your backend> <-- a separate crate
Worker (concrete, non-generic) <-- runtime integration (worker.rs)
- receives WorkerConfig from the per-backend `from_args`
- creates DistributedRuntime
- installs SIGTERM/SIGINT handlers
- calls engine.start(worker_id), registers model with discovery
- serves the generate endpoint with cancellation monitoring
- on shutdown: discovery unregister -> grace period
-> engine.drain() -> engine.cleanup()
-> 3-phase distributed-runtime teardown
run(engine, config) <-- src/run.rs
- Single entry point used by each backend's `main.rs`.
- Non-generic; holds `Arc<dyn LLMEngine>` so PyO3-wrapped engines
plug in through the same path.
from_args is not on the trait — each backend exposes an inherent
constructor that returns (Self, WorkerConfig). This keeps the trait
fully object-safe (Arc<dyn LLMEngine> must work) and lets run stay
non-generic.
generate takes GenerateContext while abort takes
Arc<dyn AsyncEngineContext>. GenerateContext derefs to
AsyncEngineContext, so cancellation calls (ctx.stopped(),
ctx.is_stopped(), ctx.id()) work the same in both. Override
abort only if you need engine-side cancel notification — most
backends rely on the default no-op plus in-stream polling.
Quick Start
Running the mocker example
The mocker is a CPU-only reference engine: it wraps the
dynamo-mocker scheduler in the LLMEngine trait and emits randomized
token IDs at a configurable rate. Useful as a stand-in for AIPerf /
end-to-end pipeline smoke tests without any ML dependencies.
A one-command docker-compose stack (NATS + etcd + frontend + mocker)
lives at
examples/mocker/docker-compose.yml.
Running your own backend
# In your backend crate (which may live in its own repo):
See the walkthrough for
how to set up the crate (Cargo.toml, tokio_unstable cfg flag, toolchain
pin) and write the engine.
Implementing a New Backend
Implement the LLMEngine trait on your engine struct, expose an
inherent from_args, and write a three-line main.rs:
use Arc;
use async_trait;
use GenerateContext;
use ;
use BoxStream;
See examples/mocker/src/engine.rs
for a complete, runnable reference and the
walkthrough for the
step-by-step including Cargo.toml, tokio_unstable cfg, and the
conformance kit.
Disaggregated Serving
The Rust crate supports prefill / decode worker splits. A backend
declares its role via WorkerConfig.disaggregation_mode and branches on
it inside generate:
use ;
let config = WorkerConfig ;
Roles and Worker behavior:
| Mode | Role | Worker effects |
|---|---|---|
Aggregated |
Self-contained inference (default) | Standard registration; KV indexer enabled |
Prefill |
Run prompt → emit 1 token + KV handoff | Registers as ModelType::Prefill; advertises bootstrap_host/port if set in EngineConfig |
Decode |
Resume from a prefill peer's KV | Disables the local indexer (KV is owned by the prefill peer) |
The crate re-exports PrefillResult and BootstrapInfo from
dynamo-llm's protocol types;
these decorate PreprocessedRequest on decode-bound requests. Prefill
terminals carry their handoff payload via the engine's terminal chunk
(e.g. disaggregated_params).
For backends with an internal KV transport (vLLM NixlConnector,
TRT-LLM's transceiver), leave EngineConfig.bootstrap_host/port None
— only SGLang uses the Dynamo-level handshake today.
Request / Response Contract
The trait works with the same PreprocessedRequest / LLMEngineOutput
types used across preprocessing, routing, and the frontend — no
Python-shaped wrappers.
generate returns a BoxStream<'static, Result<LLMEngineOutput, DynamoError>>:
- Exactly one terminal item must be the last item yielded. A
terminal is either:
Ok(chunk)withfinish_reasonset (stop/length/cancelled/error), orErr(DynamoError)carrying a typed mid-stream failure.
- Non-terminal items are
Ok(chunk)withfinish_reasonunset. - No items may follow a terminal.
Terminal chunks come from
LLMEngineOutput::stop() / ::length() / ::cancelled() / ::error(msg),
optionally chained with LLMEngineOutputExt::with_tokens(...) /
with_usage(usage(prompt, completion)). Non-terminal chunks use
chunk::token(id).
In debug builds, the framework wraps the stream in a validator
(src/validate.rs) that panics if a chunk is yielded
after a terminal. Loud failures in dev and test, compiled out in
release.
Cancellation Contract
The framework runs a per-request cancellation monitor that watches
ctx.stopped() / ctx.killed() and calls engine.abort(ctx) when
either fires. Engines also must poll ctx.is_stopped() (or
await ctx.stopped()) between yields and emit a terminal with
FinishReason::Cancelled when they observe it — the conformance kit
treats any other terminal after cancellation as ignoring the signal.
For cleanup that must run on any drop path (TCP reset, consumer
timeout without cancellation), use RAII inside the generate stream
body, not abort — abort only fires on explicit cancel. The mocker's
ActiveRequestGuard is the canonical example.
Error Handling
Errors returned from start, generate, cleanup, and from_args
use ErrorType::Backend(BackendError::X) from
dynamo-runtime. Common variants:
| Variant | When |
|---|---|
InvalidArgument |
Engine or setup rejected the input |
CannotConnect |
Can't reach discovery / a dependency |
EngineShutdown |
Engine failed to start / crashed |
StreamIncomplete |
Stream ended before the engine could finish |
Cancelled |
Request was cancelled |
ResponseTimeout, Disconnected, ConnectionTimeout |
Transport failures |
Unknown |
Uncategorized |
Mid-stream errors have two equivalent terminal forms:
- Typed (preferred): yield
Err(DynamoError)from the stream. Forwarded asAnnotated::errorwith theBackendErrorvariant preserved end-to-end. - String: yield
Ok(LLMEngineOutput::error(msg)). Convenient for pure message-level failures. Loses the typedBackendErrorvariant.
A tiny helper per backend keeps call sites clean — see the
guide's Step 6 for the
invalid_arg pattern.
Conformance Kit
Enable the testing cargo feature to pull in the kit:
[]
= { = true, = ["testing"] }
async
The kit asserts:
| Check | Failure mode |
|---|---|
start() returns a non-empty EngineConfig.model |
EmptyModelInConfig |
Single generate() ends in a terminal chunk |
NoTerminalChunk |
| No chunks after the terminal | ChunkAfterTerminal |
Interleaved generate() calls all succeed |
ConcurrentGenerateFailed |
| Mid-stream cancel terminates within 2s | CancellationNotObserved |
Cancelled stream's terminal is FinishReason::Cancelled |
CancellationIgnored |
cleanup() succeeds twice (idempotent) |
SecondCleanupFailed |
cleanup() on a never-started engine succeeds |
CleanupWithoutStartFailed |
testing::mock_context() and testing::cancelling_context(after) are
available for hand-written tests.
File Index
lib/backend-common/
Cargo.toml
CLAUDE.md # Design notes (rationale, invariants, phase plans)
README.md # This file
src/
lib.rs # Module wiring + public re-exports
engine.rs # LLMEngine trait, EngineConfig, GenerateContext,
# chunk::token, LLMEngineOutputExt setters,
# usage() helper, PreprocessedRequest / Output
# re-exports
worker.rs # Worker concrete + WorkerConfig (incl.
# disaggregation_mode); lifecycle state machine;
# signal handling + graceful shutdown orchestrator
run.rs # `pub fn run(engine, config) -> anyhow::Result<()>`
adapter.rs # EngineAdapter — bridges LLMEngine to AsyncEngine;
# cancellation monitor + debug stream validator
args.rs # CommonArgs — shared CLI flags every engine flattens
disagg.rs # DisaggregationMode enum + clap value-parser
error.rs # Re-exports DynamoError / ErrorType / BackendError
validate.rs # Debug-build stream validator (compiled out in release)
testing.rs # Conformance kit (`testing` feature)
examples/
mocker/ # CPU-only reference backend + docker-compose stack
The Python Worker shim that drives this crate from dynamo.*.unified_main
entry points lives at
components/src/dynamo/common/backend/worker.py.
See Also
- Writing a Rust Unified Backend — step-by-step walkthrough.
CLAUDE.md— design notes (rationale, invariants, Phase 2 PyO3 plans).- Mocker example — reference engine + compose stack.
- Python sibling
—
dynamo.common.backend, the Python ABC layered over this crate. - DEP #8251 — Backend Interface proposal and ongoing status.