pub struct UnifiedBatchItem {
pub seq_id: String,
pub q_tokens: Vec<u32>,
pub kv_cache: Arc<dyn KvCacheHandle>,
pub pos_offset: usize,
pub is_final_chunk: bool,
}Expand description
One sequence’s contribution to a unified mixed-batch forward.
A unified batch lets a single model forward pass process a mix of
per-sequence work units: a prefill chunk (q_tokens.len() ≥ 1, possibly
continuing from pos_offset > 0 for chunked prefill) and a decode step
(q_tokens.len() == 1, pos_offset = current cache length) coexist in
the same call. The model layer concatenates all q_tokens into one
[M_total, hidden] tensor and runs all GEMMs / norms once; only the
attention kernel sees per-item segmentation.
This is the abstraction that enables vLLM-style chunked prefill where decode tokens for already-running sequences are produced in the same iter as a prefill chunk for a newly-arriving sequence.
Fields§
§seq_id: StringIdentifier matching the sequence’s KV cache (model-side keying).
q_tokens: Vec<u32>Tokens to process this iter. For decode this is exactly 1 token; for prefill (chunked or whole) this is the chunk’s tokens.
kv_cache: Arc<dyn KvCacheHandle>KV cache handle for this sequence.
pos_offset: usizeStarting absolute position for the FIRST token in q_tokens.
0 for a fresh prefill, kv_len for a decode step or a continuing
chunked-prefill slice.
is_final_chunk: boolTrue iff this item completes the request’s prefill (or is a decode
item) — i.e. logits at the last token of q_tokens should be
returned for sampling. Intermediate prefill chunks set this false
to skip the lm_head + sampling path.
Trait Implementations§
Source§impl Clone for UnifiedBatchItem
impl Clone for UnifiedBatchItem
Source§fn clone(&self) -> UnifiedBatchItem
fn clone(&self) -> UnifiedBatchItem
1.0.0 (const: unstable) · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read more