pub struct MtpSession { /* private fields */ }Expand description
Owned MTP draft session.
Drops the underlying mtp_session * (and the C++ common_speculative *
it holds) when freed.
§Lifetime contract (manual)
The session holds raw pointers to both the target and draft
LlamaContexts. The caller must keep both contexts alive (i.e. not
drop them) for as long as the session exists. This contract is not
enforced by the borrow checker — the session does not hold Rust borrows of
the contexts, because both contexts must remain individually mutable
(you’ll be calling target.decode(...) while the session exists, and the
session also mutates the draft context internally).
Dropping a context that the session still references is undefined
behaviour at the C++ level (use-after-free inside common_speculative_*).
Implementations§
Source§impl MtpSession
impl MtpSession
Sourcepub fn new(
target: &LlamaContext<'_>,
draft: &LlamaContext<'_>,
n_seq: u32,
n_draft_max: i32,
) -> Result<Self, MtpSessionError>
pub fn new( target: &LlamaContext<'_>, draft: &LlamaContext<'_>, n_seq: u32, n_draft_max: i32, ) -> Result<Self, MtpSessionError>
Construct an MTP draft session.
target must be a LlamaContextType::Default context.
draft must be a LlamaContextType::Mtp context built from the same
model and configured with with_n_rs_seq(>= n_draft_max).
n_seq is the number of concurrent sequences (1 for a single
conversation). n_draft_max caps the number of tokens drafted per
round (commonly 3 for Qwen3.6 MTP).
§Errors
Returns MtpSessionError::Init if upstream rejects the
configuration (e.g. the model has no MTP heads).
Sourcepub fn need_embd(&self) -> bool
pub fn need_embd(&self) -> bool
True if MTP requires embeddings to be extractable from the target
context. For MTP this is always true — exposed for symmetry with
upstream’s common_speculative_need_embd.
Sourcepub fn n_draft_max(&self) -> i32
pub fn n_draft_max(&self) -> i32
Configured maximum number of tokens drafted per draft
call.
Sourcepub fn begin(
&mut self,
seq_id: i32,
prompt: &[LlamaToken],
) -> Result<(), MtpSessionError>
pub fn begin( &mut self, seq_id: i32, prompt: &[LlamaToken], ) -> Result<(), MtpSessionError>
Optional: call once at the start of a fresh generation with the prompt tokens that were just decoded into the target context.
Sourcepub fn process(&mut self, batch: &LlamaBatch) -> Result<(), MtpSessionError>
pub fn process(&mut self, batch: &LlamaBatch) -> Result<(), MtpSessionError>
Hand the session a batch that was just decoded on the target context. MTP needs to see every target batch (prompt prefill + each verification step) to keep its per-sequence pre-norm-embedding carryover in sync.
§Errors
Returns MtpSessionError::Process if upstream rejects the batch
(most often: the batch carries embd directly rather than tokens).
Sourcepub fn draft(
&mut self,
seq_id: i32,
n_past: i32,
id_last: LlamaToken,
) -> Result<Vec<LlamaToken>, MtpSessionError>
pub fn draft( &mut self, seq_id: i32, n_past: i32, id_last: LlamaToken, ) -> Result<Vec<LlamaToken>, MtpSessionError>
Generate up to n_draft_max speculative tokens
for sequence seq_id, starting from id_last at position n_past.
Returns an owned Vec<LlamaToken> of length <= n_draft_max.
§Errors
Returns MtpSessionError::BadSeqId if seq_id is outside the
configured n_seq range.