Expand description
Request queue types for the continuous-batching inference worker.
Instead of each HTTP handler holding the engine mutex directly, every
handler constructs a BatchRequest and sends it through a
tokio::sync::mpsc::Sender. A single background worker receives these
requests one at a time and drives the InferenceEngine, eliminating
mutex contention across concurrent requests.
Structs§
- Model
Meta - Metadata about the loaded model, cached at startup so route handlers do not need to hold a reference to the (now moved) engine.
- Usage
Stats - Token usage statistics for a generation request.
Enums§
- Batch
Request - A single inference request dispatched to the worker task.
Type Aliases§
- Lora
Selection - LoRA adapter selection for a single request.
- Stream
Callback - Callback invoked for each generated token during streaming.
- Vocab
Bytes - Vocabulary byte table: maps token ID to its UTF-8 byte sequence.