Module mtp

Expand description

Safe wrapper around the C++ MTP draft session.

MtpSession pairs a target LlamaContext with an MTP draft LlamaContext (built with crate::context::params::LlamaContextType::Mtp) and drives the multi-token-prediction speculative-decoding loop introduced in upstream llama.cpp PR #22673.

The draft algorithm lives in upstream’s common/speculative.cpp (common_speculative_impl_draft_mtp). This module wraps it through a stable C shim in llama-cpp-sys-4/mtp_shim/.

§Upstream behaviour (llama.cpp #23269+)

After MTP clean-up #23269:

Draft sampling uses top_k = 10 inside upstream (not configurable from Rust).
MtpSessionConfig::p_min filters low-confidence draft tokens (default 0.0).
Upstream CLI default for n_max is 3; set MtpSessionConfig::n_draft_max explicitly — optimal values are model/quant dependent (MTP.md on GitHub).

§Quick start

use llama_cpp_4::context::params::{LlamaContextParams, LlamaContextType};
use llama_cpp_4::mtp::{MtpSession, MtpSessionConfig};

let n_draft_max = 3;

let target = model.new_context(&backend, LlamaContextParams::default())?;
let draft = model.new_context(
    &backend,
    LlamaContextParams::default()
        .with_ctx_type(LlamaContextType::Mtp)
        .with_n_rs_seq(n_draft_max.max(4)),
)?;

let config = MtpSessionConfig::new(1, n_draft_max).with_p_min(0.0);
let mut session = MtpSession::new_with_config(&target, &draft, config)?;

§Speculative loop

For each generation step, after decoding on the target context:

// 1. Target prefill or verify decode (you build the batch)
target.decode(&mut batch)?;

// 2. Tell MTP about the batch just decoded on the target
session.process(&batch)?;

// 3. Ask for draft tokens starting from the last accepted token
let drafts = session.draft(0, n_past, last_token)?;

// 4. Verify drafts on the target (compare logits / sample — your code)
let n_accepted: u16 = /* ... */;

// 5. Sync draft recurrent state with what the target accepted
session.accept(0, n_accepted)?;

Call MtpSession::begin once per fresh generation if you want upstream prompt tracking (optional for MTP). Call MtpSession::print_stats when finished to log draft/accept counters via llama.cpp’s log callback.

A full runnable implementation is in examples/mtp/.

§Embedding requirements

Method	MTP typical value	Meaning
`MtpSession::need_embd_pre_norm`	`true`	Next-n hidden states (upstream name)
`MtpSession::need_embd`	`false`	Post-norm / seq embeddings not used

§Multi-head `NextN` (Step3.5+)

When crate::model::LlamaModel::n_layer_nextn returns a value greater than 1, set the draft context head before each MtpSession::draft call:

for head in 0..model.n_layer_nextn() {
    draft.set_nextn_layer_offset(head);
    let drafts = session.draft(0, n_past, last_token)?;
    // verify on target ...
}
draft.set_nextn_layer_offset(0); // restore default

Structs§

MtpSession: Owned MTP draft session.
MtpSessionConfig: Parameters for MtpSession::new_with_config.

Enums§

MtpSessionError: Errors raised by the MTP draft session.

Module mtp

Module mtp Copy item path

§Upstream behaviour (llama.cpp #23269+)

§Quick start

§Speculative loop

§Embedding requirements

§Multi-head NextN (Step3.5+)

Structs§

Enums§

Module mtp

§Multi-head `NextN` (Step3.5+)