Struct Qwen3Runner

Source

pub struct Qwen3Runner { /* private fields */ }

Expand description

Resolved Qwen3 runner — call Qwen3Runner::generate for streaming decode (F32 path), or Qwen3Runner::predict_logits for a single forward pass (works in both F32 and packed modes).

Implementations§

Source §

pub fn disable_decode_compile_cache(&mut self)

Bypass the cached decode path; every generated token re-runs the full prefill graph from scratch. Slow (O(N²)) but a reference for numerical parity checks against the cached path.

Source

pub fn predict_logits(&mut self, prompt_ids: &[u32]) -> Result<Vec<f32>, Error>

Generate n_new tokens after the given prompt. on_token is called once per generated id when stream(true) is set; otherwise the callback fires once at the end with the full vector. Returns the full generated id sequence.

The prompt is expected as raw token ids — tokenizer integration lives outside this module today (use the example binary for an end-to-end pipeline that wires tokenizers). Run a single prefill pass and return the last-position logits. Works in both F32 mode and packed-weights mode — in packed mode this is the only forward path supported today (streaming decode still goes through the F32 generator).

The prompt length must match the bucket the runner was built for (max_seq); shorter prompts are padded with the first token, longer prompts are truncated.

Source

pub fn generate_packed( &mut self, prompt_ids: &[u32], n_new: usize, on_token: impl FnMut(u32), ) -> Result<Vec<u32>, Error>

Generate n_new tokens via repeated packed-mode prefills. Each step runs the full prefill graph against the growing token history (padded/truncated to max_seq), samples the next id, and appends it. Calls on_token per id.

Trade-off vs generate() on the F32 path: every token pays a full prefill instead of one decode step, so wall-clock throughput is ~max_seq × slower. Memory stays packed though — the only path that actually loads 14 B+ Q4_K_M GGUFs on a 32 GB Mac today. Tighter throughput needs the real bucketed decode-graph machinery (separate TODO; see CHANGELOG known-limitations).

Source

pub fn generate( &mut self, prompt_ids: &[u32], n_new: usize, on_token: impl FnMut(u32), ) -> Result<Vec<u32>, Error>

Source

pub fn generate_stoppable( &mut self, prompt_ids: &[u32], n_new: usize, on_token: impl FnMut(u32) -> bool, ) -> Result<Vec<u32>, Error>

Like [generate] but the callback can return false to stop sampling early (e.g. on EOS).

Trait Implementations§

Source §

impl LmRunner for Qwen3Runner

Source §

fn family(&self) -> &'static str

Short family identifier ("qwen3", "llama32", "gemma").

Source §

fn vocab_size(&self) -> usize

LM head vocabulary size.

Source §

fn predict_logits(&mut self, prompt_ids: &[u32]) -> Result<Vec<f32>, Error>

Run prefill on prompt_ids and return last-token logits.

Source §

fn generate( &mut self, prompt_ids: &[u32], n_new: usize, on_token: &mut dyn FnMut(u32) -> bool, ) -> Result<Vec<u32>, Error>

Generate up to n_new tokens after prompt_ids using greedy (argmax) sampling. The default impl re-prefills on the full context each step — per-family runners should override with their cached decode fast path. Read more

Source §

fn supports_multimodal(&self) -> bool

Whether this runner supports multimodal (image+text) generation.

Source §

fn generate_multimodal( &mut self, _prompt: &str, _rgb: &[u8], _img_w: usize, _img_h: usize, _tokenizer: Option<&Path>, _n_new: usize, _on_token: &mut dyn FnMut(u32) -> bool, ) -> Result<Vec<u32>, Error>

Multimodal generation: prefill with text where image markers are spliced with vision embeddings derived from rgb.

Auto Trait Implementations§

§

impl !RefUnwindSafe for Qwen3Runner

§

impl !Sync for Qwen3Runner

§

impl !UnwindSafe for Qwen3Runner

§

impl Freeze for Qwen3Runner

§

impl Send for Qwen3Runner

§

impl Unpin for Qwen3Runner

§

impl UnsafeUnpin for Qwen3Runner

Blanket Implementations§

Source §

impl<T> Any for T
where T: 'static + ?Sized,

Source §

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more

Source §

impl<T> Borrow<T> for T
where T: ?Sized,

Source §

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more

Source §

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source §

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more

Source §

impl<ST, DT> CastableFrom<ST, Initialized, Initialized> for DT
where ST: ?Sized, DT: ?Sized,

Source §

impl<ST, DT> CastableFrom<ST, Uninit, Uninit> for DT
where ST: ?Sized, DT: ?Sized,

Source §

impl<T> From<T> for T

Source §

fn from(t: T) -> T

Returns the argument unchanged.

Source §

impl<T, U> Into for T
where U: From<T>,

Source §

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source §

impl<T> IntoEither for T

Source §

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more

Source §

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more

Source §