Skip to main content

Module queue

Module queue 

Source
Expand description

Request queue types for the continuous-batching inference worker.

Instead of each HTTP handler holding the engine mutex directly, every handler constructs a BatchRequest and sends it through a tokio::sync::mpsc::Sender. A single background worker receives these requests one at a time and drives the InferenceEngine, eliminating mutex contention across concurrent requests.

Structs§

ModelMeta
Metadata about the loaded model, cached at startup so route handlers do not need to hold a reference to the (now moved) engine.
UsageStats
Token usage statistics for a generation request.

Enums§

BatchRequest
A single inference request dispatched to the worker task.

Type Aliases§

LoraSelection
LoRA adapter selection for a single request.
StreamCallback
Callback invoked for each generated token during streaming.
VocabBytes
Vocabulary byte table: maps token ID to its UTF-8 byte sequence.